ML-model

Part I

Prediction and Analysis of Global Epidemiological Trends in Bladder Cancer

1. Materials and Methods

1.1. Data Source

The global challenge of bladder cancer necessitates prospective epidemiological data to guide prevention and control efforts. Establishing effective prediction models is therefore crucial for assessing disease trends and formulating interventions. Data for this module were obtained from the Global Burden of Disease (GBD 2021) database, developed by the Institute for Health Metrics and Evaluation (IHME). This platform provides detailed statistics from 1990 to 2021 on diseases, injuries, and risk factors globally and regionally, allowing analysis by age, sex, and region, thus supporting public health research and policy. The general methodology for GBD 2021 data extraction is detailed on the official GBD website (http://www.healthdata.org/gbd/). This study extracted global data on bladder cancer Disability-Adjusted Life Years (DALYs), prevalence, and deaths for the period 1990–2021.

Where YLLs denote years of life lost due to premature mortality, and YLDs represent years lived with disability.DALYs are a composite metric quantifying the healthy life years lost due to premature mortality (Years of Life Lost, YLLs) and disability (Years Lived with Disability, YLDs) from various diseases. It combines the loss from premature death (difference between actual age at death and standard life expectancy at that age) and the loss of health due to disability (Formula 1).Where YLLs denote years of life lost due to premature mortality, and YLDs represent years lived with disability.

Formula 1

1.2. Data Processing

A database of global bladder cancer DALYs, prevalence, and death counts and rates (per 100,000 population) from 1990 to 2021 was constructed using Excel 2019 to analyze trends in disease burden. The dataset contained no missing values, and no outliers were identified using the interquartile range (IQR) method (with outliers defined as values beyond Q1-1.5IQR or Q3+1.5IQR). The data were then partitioned into a training set (1990–2016) and a validation set (2017–2021) for assessing and optimizing prediction model performance. To enhance prediction accuracy and stability, three distinct forecasting methods—ARIMA, LSTM, and NNAR—were employed as base models, leveraging their respective strengths and applicability. Model development and comparison were implemented using the "forecast" package in R (v4.3.1) and sklearn (Python v3.11.1), among other tools.

1.3. Autoregressive Integrated Moving Average (ARIMA) Model

The ARIMA model is suitable for stationary time series data, integrating Autoregressive (AR), Integrated (I - differencing), and Moving Average (MA) components. It effectively captures linear relationships and demonstrates strong capability in modeling trending data, making it appropriate for fitting the rising incidence and mortality trends of bladder cancer[1]. Its basic principle can be expressed as:

Formula 2

Where:\(\phi\) represents the AR coefficients,\(\theta\) represents the MA coefficients,\(\epsilon_t\) is the error term.

1.4. Long Short-Term Memory (LSTM) Network

LSTM is a specialized type of Recurrent Neural Network (RNN). This model learns a function that maps historical data points to future observations. It utilizes sequences derived from the data to predict future values. The model requires a list of features (\(X\)), corresponding labels (\(y\)), and a window size (\(time\_steps\)). The process involves sliding a \(time\_steps\)-sized window across the data to generate target values (\(y_s\)) and input sequences (\(X_s\)). \(y_s\) is the value immediately following the \(X_s\) window, which consists of \(time\_steps\) consecutive feature values.LSTM's unique gating mechanism enables it to handle complex nonlinear dynamics and long-term dependencies, making it ideal for processing long sequence data. The fundamental algorithm is:

Formula 3

\(f_t\) represents the forget gate, \(W_f\) is a weight coefficient, \(b_f\) is the bias, \(x_t\) represents the input. Subsequently, The model decides which values to keep (\(r_t\)) and generates candidate values (\(\tilde{C}_t\)) via the \(\tanh\) function. Combining these results yields the updated cell state:

Formula 4

Formula 5

Formula 6

In Formula (4), \(W_c\) represents the weight coefficients for the input gate.

Formula 7

Formula 8

This process yields the output of the LSTM model, whose basic architecture is shown in Figure 1. The original time series data were normalized to the [0, 1] range using the MinMaxScaler method to eliminate the influence of feature scale differences on model training, as expressed by the formula:

Formula 9

Subsequently, the time series was transformed into a supervised learning format using a sliding window mechanism. The model design employed a single-layer LSTM network with 50 hidden units to capture long-term dependencies and nonlinear dynamics within the sequence. A fully connected layer was added after the LSTM layer to map the temporal features to a scalar prediction value. During training, the Adam optimizer and Mean Squared Error (MSE) loss function were used. Training proceeded for 100 epochs with a batch size of 1 (online learning), aiming to minimize the error between predicted and true values.

Figure 1: LSTM algorithm schematic diagram

1.5. Neural Network Autoregression (NNAR) Model

The NNAR model is a nonlinear time series model that utilizes a single-hidden-layer feedforward neural network structure to capture nonlinear relationships within fluctuating sequences. Its default setting of 20 hidden nodes allows it to capture complex nonlinear features in the data. The model uses lagged values of the time series as input features, constructing a specific type of neural network denoted as NNAR(p, k), where p represents the number of lagged inputs and k is the number of nodes in the hidden layer[2]. NNAR can also provide more stable prediction results when combined with other models, showing potential for model improvement[3].

1.6. Model Performance Comparison

In this study, ARIMA, LSTM, and NNAR models were used separately to perform short-term trend predictions for bladder cancer DALYs, prevalence, and death-related data (counts and rates). The reliability of the three models was assessed using data from 2017–2021 as the validation set. Three evaluation metrics—Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE)—were employed to quantify the prediction error of each model. Subsequently, based on the validation results, the optimal model was selected, retrained on the full dataset, and used to forecast trends up to 2035.

RMSE is the square root of the mean of the squared differences between predicted and actual values. It is sensitive to large errors and can indicate the presence of outliers. MAE is the mean of the absolute differences. It is less sensitive to outliers and intuitively reflects the accuracy of the predicted trend. MAPE is the mean of the absolute percentage differences. It reflects relative error and is suitable for comparing model performance across datasets with different scales. The calculations are as follows:

Formula 10

Formula 11

Formula 12

Where \(\hat{y}_{i}\) is the predicted value, \(y_{i}\) is the actual value, and \(n\) is the number of samples.

2. Results

2.1. Overview of Global Bladder Cancer Incidence and Mortality, 1990–2021

In 1990, the global number of bladder cancer cases was 1.3291 million, and deaths were 123,100. In 2021, the number of cases reached 3.0256 million, and deaths exceeded 221,900 (Figure 2). Compared to 1990, the global number of cases, deaths, and DALYs increased by 127.65%, 80.21%, and 60.89%, respectively.

Bladder cancer incidence and mortality rates showed a continuous upward trend globally (Figure 3). Due to difficulties in early diagnosis and treatment resistance, bladder cancer ranked as the 14th leading cause of cancer death globally in 2021.

Figure 2: Global bladder cancer epidemiological data from 1990 to 2021

Figure 3: Changes in global bladder cancer mortality rates over 30 years

2.2. Model Performance Comparison

Comparing the three prediction performance metrics, the ARIMA model outperformed the NNAR and LSTM models in predicting global bladder cancer epidemiological trends, indicated by lower values for all three metrics (Table 1). Particularly for prevalence prediction, the MAPE for the ARIMA model was 0.352%, and its RMSE was only 17.9% of that of the NNAR model. For death data predictions, the performance of ARIMA and NNAR models was similar.

Table 1: Predictive Model Evaluation Performance (DALYs)

Table 2: Predictive Model Evaluation Performance (Prevalence)

Table 3: Predictive Model Evaluation Performance (Deaths)

Comparative analysis of the aforementioned models revealed that the ARIMA model holds a significant advantage in capturing the average evolutionary patterns of linear trends but faces limitations in resolving nonlinear dynamics. Conversely, the LSTM model excels at modeling nonlinear patterns. Consequently, this study proposed constructing an ARIMA-LSTM ensemble model to integrate the strengths of both approaches, enabling the identification and characterization of both linear and nonlinear components within the data[4].

In integrating the LSTM and ARIMA models, since they handle nonlinear and linear prediction tasks respectively, direct weighted averaging might lead to accumulated prediction errors. To enhance the ensemble model's fitting accuracy, this study employed the LSTM model to process the prediction residuals from the ARIMA model, aiming to eliminate potential nonlinear factors within the residuals and further optimize prediction performance[5]. The construction strategy was as follows:

First, the ARIMA model was applied to analyze the linear component of the bladder cancer DALYs, prevalence, and death data (counts and rates). Residual diagnosis via autocorrelation/partial autocorrelation function plots and the Ljung-Box test indicated that the residual series satisfied the white noise property (p > 0.05), suggesting the ARIMA model had sufficiently extracted information from the time series (Figure 4).

Figure 4: Autocorrelation/Partial Autocorrelation Function Plots

Subsequently, the residuals from the ARIMA model, which often contain nonlinear features, were extracted. These residuals were fed into the LSTM model, which can handle complex nonlinear relationships and correct the residuals, thereby enhancing prediction accuracy. The final expression is:

Formula 13

Where \(\hat{Y}_{\text{ARIMA}}\) is the linear prediction from the ARIMA model, and \(\hat{Y}_{\text{LSTM}}\) is the LSTM prediction of the ARIMA residuals[6]. Thus, this ensemble model improves overall prediction accuracy through joint modeling of linear and nonlinear components.

Figure 5 illustrates the fitting performance of the ARIMA, LSTM, and ARIMA-LSTM models on the test set data, showing the clear upward trend in global bladder cancer incidence and mortality rates. Compared to the single ARIMA model, the ARIMA-LSTM ensemble model more accurately captured this characteristic. For DALYs prediction on the test set, the MAPE of the ARIMA-LSTM ensemble model was only 0.301%, nearly doubling the prediction accuracy compared to the ARIMA model (Table 4).

Figure 5: Fitting performance of the three models on the test set data

Table 4: Predictive Model Evaluation Performance (DALYs) - Comparison including Ensemble

2.3. ARIMA-LSTM Ensemble Model Prediction Results

Figure 6: ARIMA-LSTM model prediction results

The established ARIMA-LSTM model was used to forecast the absolute numbers and incidence rates of global bladder cancer DALYs, prevalence, and deaths for the next decade (Figure 6).

For prevalence, the model results indicate a continuing upward trend in the coming decades. The number of cases is projected to increase from 3.0256 million in 2021 to 3.7955 million in 2035, with the prevalence rate per 100,000 expected to grow by 15.75% over the same period, showing significant average annual increases in both number and rate (Table 5). This growth trend is closely associated with persistent risk factors such as population aging, high smoking rates, and occupational carcinogen exposure.

Simultaneously, predictions from the ARIMA-LSTM ensemble model suggest a stable yet rising trend in the global disease burden attributable to bladder tumors. The DALYs rate per 100,000 is projected to increase by 3.63%, and bladder cancer deaths are expected to exceed 265,000 by 2035. This trend underscores that improving patient outcomes has become a critical issue urgently needing address in global public health. Future efforts must strengthen early diagnosis, standardized treatment, and long-term management to mitigate the health loss and social burden caused by this disease through multi-level interventions.

In summary, bladder cancer constitutes a significant public health challenge worldwide. The prediction results emphasize the necessity for increased investment and attention towards bladder cancer prevention and control to alleviate its potential impact on public health systems and socioeconomic development.

Table 5: Prediction Analysis Data

Part II

Screening of Diagnostic Genes and Analysis of Immune Cell Infiltration Characteristics in Bladder Cancer

1. Materials and Methods

1.1. Data Preparation

Microarray datasets GSE13507 (GPL6102, normal tissue=10, tumor tissue=165) and GSE40355 (GPL13497, normal tissue=8, tumor tissue=16) were selected and downloaded from the Gene Expression Omnibus (GEO) database, yielding 18 normal and 181 bladder tumor tissue samples. The GEO data served as the training set. RNA-seq FPKM data for bladder cancer were obtained from The Cancer Genome Atlas (TCGA) database, comprising 19 normal and 411 bladder tumor tissue samples, which served as the validation set. Clinical data for TCGA bladder cancer cases were also downloaded, and survival information was compiled for 404 patients. During data processing, missing values were detected in some samples. To preserve data distribution characteristics and minimize information loss, missing values were imputed using the supervised k-Nearest Neighbors (KNN) algorithm[7]. Compared to deleting samples or using global mean/median imputation, KNN imputes missing values based on the feature values of the k most similar complete samples, calculated via sample similarity[8]. Low-variance genes were filtered out, retaining the top 80% of genes by variance.

1.2. Screening of Differentially Expressed Genes (DEGs)

Batch effects in the training set data were corrected using the "sva" package. Differential expression analysis was performed using the "limma" package. Based on literature precedents[9], the filtering criteria for DEGs were set as \( \left| \log FC \right| > 2 \), (FC: fold change) and adjusted p-value (adj. P.Val) < 0.05. All R packages in this study were run on the R 4.1.0 platform.

1.3. Gene Function and Pathway Enrichment Analysis

Functional annotation of DEGs was performed using the "clusterProfiler" R package against the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) databases. Gene Set Enrichment Analysis (GSEA) was conducted to analyze functional differences between the experimental (tumor) and control (normal) groups. A p-value < 0.05 indicated significant enrichment.

1.4. Machine Learning Algorithms for Screening Key Bladder Cancer Genes

Multiple studies advocate using multi-faceted feature selection strategies to construct high-accuracy cancer diagnostic prediction systems[10,11]. To further identify key genes for bladder tumors from the DEGs, we employed a combination of three machine learning algorithms: LASSO regression (Least Absolute Shrinkage and Selection Operator regression), SVM-RFE (Support Vector Machine-Recursive Feature Elimination), and Boruta, for screening diagnostic biomarkers. This approach leverages the complementary strengths of the algorithms to overcome selection bias inherent in single methods and ensures the biological robustness of the feature genes. The "glmnet" package was used for LASSO regression, performing 10-fold cross-validation to find the lambda value corresponding to the minimum error, thus obtaining the feature genes at that point. The SVM-RFE algorithm was implemented using the "e1071", "kernlab", and "caret" packages. The prediction function was set to "caretFuncs", the sampling method was "cv" (cross-validation), and the "svmRadial" method was used for feature selection. The Boruta algorithm allows for all-relevant feature importance assessment; its implementation process is summarized in Figure 7[12,13]. The feature genes identified by all three methods were intersected using the "venn" package to obtain the key diagnostic genes for bladder cancer.

Figure 7: Flowchart of the Boruta algorithm

1.5. Validation of Key Genes

The validation set data were used to verify the identified key genes. Boxplots comparing expression between normal and bladder tumor tissues were generated using the "ggpubr" package, and the Wilcoxon rank-sum test was used to assess statistical significance between groups. Receiver Operating Characteristic (ROC) curves for the key genes were plotted using the "pROC" package.

1.6. Survival Analysis of Key Genes

Based on the median expression value of each key gene, patients were divided into high-expression and low-expression groups. Survival curves were plotted using the "survival" package, and differences between groups were assessed using the Log-Rank test. Results were further validated using the GEPIA database (http://gepia.cancer-pku.cn/) .

1.7. Correlation Analysis Between Prognostic Genes and Immune Infiltrating Cells

The adjusted validation set data were uploaded to the CIBERSORT web portal (http://cibersort.stanford.edu/) for analysis of Tumor Immune Infiltrating Cells (TICs). Using the LM22 signature matrix (22 immune cell types) as a reference, 1000 permutations were run, and results were filtered with a significance threshold of p < 0.05. The "vioplot" package was used to analyze differences in immune cell abundance between normal and tumor tissues. The "limma", "reshape2", and "ggExtra" packages were used to assess the relationships between prognosis-related genes and immune infiltrating cells. Spearman's rank correlation test was used for statistical analysis, and correlation heatmaps were generated using "ggplot2".

1.8. Prognostic Analysis of Immune Infiltrating Cells

Based on the median abundance of each lymphocyte subset in the Tumor Microenvironment (TME), the 22 lymphocyte types were dichotomized into high and low groups. Survival analysis was performed again using TCGA clinical data to explore the impact of lymphocyte abundance on patient prognosis.

2. Results

2.1. Differentially Expressed Genes

A total of 176 differentially expressed genes (DEGs) were identified from the GEO database, comprising 21 upregulated and 155 downregulated genes. A volcano plot was generated for the top 50 genes ranked by absolute fold change, as shown in Figure 8.

Figure 8: Volcano plot of differentially expressed genes between groups in the training set

2.2. Identification of Biological Processes and Pathways for DEGs

GO functional annotation categorizes gene functions into Cellular Component (CC), Molecular Function (MF), and Biological Process (BP). The DEGs were significantly enriched in 46 functional terms, including 19 BPs, 15 CCs, and 12 MFs (Figure 9A). In BP, DEGs were enriched in pathways related to the muscle system process. In CC, DEGs were enriched in functions related to the collagen-containing extracellular matrix. In MF, DEGs were significantly enriched in activities such as actin binding. KEGG pathway enrichment analysis (Figure 9B) showed significant enrichment of DEGs in pathways like MAPK signaling, cell cycle, and focal adhesion. GSEA revealed the top 5 active pathways in the control (normal) group were calcium signaling pathway, MAPK signaling pathway, vascular smooth muscle contraction, etc. (Figure 9C). In contrast, the top 5 active pathways in the tumor group were cell cycle, DNA replication, proteasome, and pathways in cancer, etc. (Figure 9D), showing substantial differences from the control group.

Figure 9: A: GO functional annotation results for DEGs; B: KEGG enrichment analysis results; C-D: Differences in functions and pathways between tumor and control groups via GSEA

2.3. Key Genes

The linear fit result [from LASSO cross-validation] indicated that the cross-validation error was minimized when 11 genes were selected (Figure 10). LASSO regression identified 11 potential feature genes: CYTL1, SRPX, LIFR, etc (Table 6).

Figure 10: LASSO regression cross-validation linear fit result

The SVM-RFE algorithm result is shown in Figure 11. The x-axis represents the number of features, and the y-axis represents the cross-validation error. The error was minimized when the feature dimension was reduced to 10, indicating optimal model performance. Ten key genes were ultimately selected: PI16, HLF, PMP2, AURKB, CYTL1, etc (Table 6).

Figure 11: SVM-RFE cross-validation result

The Boruta algorithm finally confirmed 16 genes with significant diagnostic value (Figure 12 and Table 6), including CYTL1, IQGAP3, MOXD1, SRPX and so on.

Figure 12: Feature gene selection results using the Boruta algorithm

Intersection analysis of the three algorithms (Figure 13) revealed that five genes—CYTL1, PMP2, IQGAP3, ADH1B, and SRPX—were commonly identified as candidate diagnostic biomarkers for bladder cancer. Except for IQGAP3, which was highly expressed in the tumor group, the other genes showed low expression.

Figure 13: Intersection of results from the three algorithms

Figure 14: A: Boxplot of key gene expression in the validation set, showing significant differences between normal and bladder tumor tissues; B-F: ROC curves for the key genes

Table 6: List of Potential Genes Identified by Each Machine Learning Algorithm

2.4. Validation of Key Genes

Boxplots revealed significant differences in the expression levels of the key genes between tumor and normal groups in the validation set, with only IQGAP3 highly expressed in the tumor group (Figure 14A), consistent with the results from the training set. ROC curves for the key genes were plotted using the validation set data (Figure 14B-F). All key genes demonstrated high sensitivity and specificity. Their Area Under the Curve (AUC) values and 95% confidence intervals were: CYTL1 (0.89, 95% CI: 0.82–0.95), PMP2 (0.97, 95% CI: 0.96–0.99), IQGAP3 (0.92, 95% CI: 0.83–0.99), ADH1B (0.98, 95% CI: 0.96–0.99), and SRPX (0.94, 95% CI: 0.90–0.97).

2.5. Survival Analysis of Key Genes

Survival analysis results indicated that patients in the high-expression groups for PMP2 (p = 0.002) and CYTL1 (p = 0.032) had lower survival rates (Figure 15A-B). The same conclusion was reached using the GEPIA database, where high expression of CYTL1 (p = 0.018) and PMP2 (p = 0.044) was associated with worse prognosis (Figure 15C-D).

Figure 15: A-B: Survival curves based on CYTL1 and PMP2 expression levels; C-D: Survival curves validated using the GEPIA database

2.6. Correlation Analysis Between Key Genes and TICs

Macrophages M0 (p = 0.008), Macrophages M1 (p = 0.01), and Mast cells resting (p = 0.005) showed significant differences between bladder tumor and normal tissues. Resting mast cells were present at low levels in tumor tissues, while M0 and M1 macrophages were present at high levels. In the TME, the expression levels of CYTL1 and PMP2 were negatively correlated with T follicular helper cells and CD8 T cells, but positively correlated with resting memory CD4 T cells and naive B cells (Figure 16A).

2.7. Prognostic Analysis of TICs

Survival analysis of TICs found that only activated memory CD4 T cells (p = 0.017) and CD8 T cells (p = 0.049) were significantly associated with patient prognosis (Figure 16B-C). Patients with high infiltration levels of activated memory CD4 and CD8 T cells in the TME had higher survival rates.

Figure 16: A: Heatmap of correlations between 22 immune cell types and CYTL1/PMP2 genes; B-C: Survival curves for activated memory CD4 T cells and CD8 T cells

Part Ⅲ

Part Ⅲ: Predicting Key Molecules in Bladder Cancer via Deep Learning

To address the application of deep learning in bladder cancer analysis, we integrated multi-omics data while accounting for interrelationships among different omics types, constructing a heterogeneous network encompassing diseases and genes. To effectively represent both diseases and genes, we developed an unsupervised multi-layer graph neural network model based on a classical graph convolutional network architecture. This model not only predicts key genes involved in the development and progression of bladder cancer but also provides interpretability for the predictions.

1. Model Framework

Graph Neural Network (GNN) apply neural networks to graph-structured data to learn structural information and uncover latent semantic patterns, representing graph nodes or the entire graph as low-dimensional vectors. They are widely used for tasks such as graph clustering, classification, and link prediction. GNNs have demonstrated strong representational power in recommendation systems, drug discovery, molecular generation, and disease-gene prediction. The associations between various genes and diseases naturally form a complex network, and GNNs can efficiently learn features from such networks to predict disease-associated genes.

1.1. Feature Extraction Module

This part introduces improvements to the traditional GCN feature extraction module. The input data form a heterogeneous bipartite graph containing a set of gene nodes \(G\) and a set of disease nodes \(D\). The initial feature of a gene \(g_i\) is denoted as \(x^0_{g_i}\), and that of a disease \(d_i\) as \(x^0_{d_i}\).

The updated features after GNN processing should capture structural information from the graph. Therefore, for feature learning of node i, we consider both its own features and those of its neighbors. Traditional GCNs aggregate information from all neighbors, which is computationally inefficient for large-scale disease-gene networks with high node degrees. To improve efficiency, we randomly sample neighbors for each target node and aggregate information only from the sampled set. Each sampling and aggregation operation allows a node to receive information from its one-hop neighbors. In complex networks, however, information propagation occurs not only between directly connected nodes but also indirectly. While traditional GCNs perform two aggregation steps, our improved model conducts L(L=3) rounds of sampling and aggregation.

This design ensures that nodes receive sufficient structural information from longer paths without excessive aggregation, which could lead to over-smoothing and loss of feature distinctiveness. During each round of sampling and aggregation, node features are updated. Since features at different layers contain different structural information, we employ residual connections. The output of the feature extraction module is the concatenation of features from all L[14].

1.2. Prediction Module

For a gene g and a disease d , afterL layers of sampling and feature aggregation, we obtain feature sets \(\{x_{g}^{1}, x_{g}^{2}, \cdots, x_{g}^{L}\}\) and \(\{x_{d}^{1}, x_{d}^{2}, \cdots, x_{d}^{L}\}\). The output of the feature extraction module is:

Formula 14

Formula 15

where || denotes tensor concatenation. This operation enriches the initial features of genes or diseases and allows control over the breadth of information aggregation by adjusting L. Other aggregation methods such as weighted averaging, max pooling, or LSTM could be used, but we choose concatenation for simplicity and to avoid introducing excessive parameters.

Finally, the potential relationship between gene g and disease d is evaluated using the inner product of their feature vectors:

Formula 16

1.3. Optimization Objective

To accurately learn model parameters, we improve the loss function by adopting the Bayesian Personalized Ranking (BPR) loss, widely used in recommendation systems, instead of traditional cross-entropy loss. BPR aims to maximize the score difference between positive and negative samples, ensuring that positive samples receive higher scores than negative ones[15]. The objective function is formulated as:

Formula 17

where \( \mathcal{R} = \{(g, d^+, d^-) \mid (g, d^+) \in \mathcal{R}^+, (g, d^-) \in \mathcal{R}^-\} \) represents all possible gene-disease association pairs in the training set, \( \mathcal{R}^+ \) denotes existing associations, and \( \mathcal{R}^- \) denotes non-existing associations; \( \sigma \) is the sigmoid function; \( \lambda \|\mathbf{W}\|_2^2 \) is the \( L_2 \) regularization term, and \( \mathbf{W} \) represents all learnable parameters.

2. Model Training and Prediction

Training data were obtained from the Comparative Toxicogenomics Database (CTD, http://ctdbase.org), which integrates extensive data on interactions among genes, diseases, drugs, and functional phenotypes, providing support for disease-gene association prediction and disease target identification[17]. Data included gene-disease associations, disease-pathway associations, and gene-pathway associations. The constructed network contained 53,372 genes, 7,234 diseases, and 93,443,157 edges.

Since the GNN model requires initial node features, one-hot encoded vectors based on gene-pathway and disease-pathway associations were used as initial features for genes and diseases, respectively, both of dimension 2,547. The first part of the model is a feature mapping layer, where initial features are transformed into 256-dimensional vectors via a multi-layer perceptron, matching the dimension used in GNN feature updates. The mapped features, along with the gene-disease network, are fed into the feature extraction layer. The output—concatenated updated features—is then passed to the prediction module to complete the forward propagation process.

The model was trained with a learning rate of 10-5 and a batch size of 512. To demonstrate the necessity of the BPR loss, we compared its performance with cross-entropy loss, as shown in Figure 17. Cross-entropy loss provided very weak supervisory signals and remained stable, whereas BPR loss decreased significantly and stabilized after about 200 iterations, at which point training was terminated.

Figure 17: Loss function values during model training

Focusing on bladder cancer, we examined the correlation between features of five genes identified in Part II and known bladder cancer marker genes[18] during training. As shown in Figure 18, the horizontal axis represents training iterations, and the vertical axis represents feature correlation between genes and bladder cancer. Genes such as ADH1B, CYTL1, EGFR, and FGFR3 showed increasing correlation with bladder cancer over iterations, indicating the model's strong performance in identifying bladder cancer-associated genes.

Figure 18: Correlation changes between specific genes and bladder cancer during training

3. Results Analysis

Section 1 described the model structure and training process, and the model demonstrated strong performance in bladder cancer analysis. However, as a black-box model, deep learning requires interpretability in cancer research. We performed UMAP dimensionality reduction on the features of all 53,372 genes learned by the model, resulting in a 2D scatter plot (Figure 19). Genes with the top 15 highest correlations to bladder cancer are marked in red, and those with the lowest 15 correlations in green.

Among these, PIK3CA mutation is one of the most widespread mutations in cancer[19] and is significantly associated with poor survival in multiple malignancies[20]. The p110α subunit, encoded by PIK3CA and part of class IA phosphatidylinositol 3-kinase (PI3K), regulates signal transduction related to cell cycle, growth, and metabolism. Disruption of this process is highly oncogenic. In TCGA bladder cancer samples, PIK3CA has a mutation frequency of 22.1%. Mutations in RAS, KRAS, and NRAS are also common oncogenic drivers in human cancers. The variant frequency of KRAS alleles is cancer-specific, involving different oncogenic processes across cancers[21]. In TCGA bladder cancer samples, KRAS and HRAS have mutation frequencies of 3.9% and 1.9%, respectively. Previous studies have shown that ERBB2, MAPK1, and MAPK3 are highly expressed in cancer tissues compared to normal tissues. Silencing ERBB2 may inhibit the activation of the MAPK1/MAPK3 pathway, thereby suppressing cancer cell proliferation, invasion, and migration[22]. Notably, the gene CYTL1, predicted by our model to be highly associated with bladder cancer, has a mutation frequency of 47.3% in TCGA bladder cancer samples, second only to TP53.

Figure 19: Two-dimensional scatter plot of genes in the network

To functionally validate the predictions, we performed KEGG enrichment analysis on three gene sets: genes directly linked to bladder cancer (BLGs), genes not directly linked (NBLGs), and genes predicted to be associated by the model (PBLGs). Results are shown in Figures 20(a), 20(b), and 20(c). The size of points represents the number of genes enriched in each pathway, and significance (FDR) is transformed via -log10. BLGs were significantly enriched in cancer-related pathways, while PBLGs were enriched in immune and inflammation-related pathways. Immune and inflammatory responses are closely related to cancer development and are central to tumor microenvironment research[23]. The predictions suggest strong associations between immune/inflammation-related genes and bladder cancer, potentially offering new insights for immunotherapy target prediction. Many NBLGs were enriched in cancer-related pathways, possibly because NBLGs include potential bladder cancer-related genes, i.e., there is an intersection between NBLGs and PBLGs.

Figure 20: KEGG enrichment analysis results for BLGs, NBLGs, and PBLGs

Using gene-pathway relationships, we designed initial gene features thoughtfully. Visualization analysis of model parameters identified key signaling pathways in bladder cancer gene prediction (Figure 21). Among the top 20 pathways by importance, the metabolic pathway scored highest. Extensive studies have shown that metabolic dysregulation is a hallmark of cancer and may be a root cause of tumor formation. Altered metabolism is widespread across cancer subtypes, and many oncogenes or tumor suppressor genes function to maintain these metabolic states. Research on cancer metabolism may reveal critical diagnostic and therapeutic markers[24]. Immune system, cell cycle, and extracellular matrix (ECM) organization pathways also received high importance scores.

Figure 21: Bar chart of pathway importance scores

Tumor growth and development are closely related to changes in ECM density and composition. Cancer-associated fibroblasts interact with other cells in the tumor microenvironment, regulating ECM components and promoting carcinogenesis. Under regulatory influences, ECM undergoes stabilization or degradation, directly affecting cancer cell proliferation, migration, invasion, and angiogenesis. ECM stabilization also impedes drug penetration into tumors[25]. Therefore, maintaining ECM stability and promoting chemotherapeutic drug penetration represent potential avenues for future cancer therapy. In Part II, we identified differentially expressed genes from GEO and TCGA bladder cancer samples and recognized functional subtypes, with some genes significantly enriched in ECM organization pathways. Based on clustering and cell subtype annotations from single-cell sequencing data[26], we identified major cell types including endothelial, epithelial, and tissue stem cells. Marker genes for endothelial and tissue stem cells were also enriched in ECM organization pathways. By constructing a heterogeneous network of genes, diseases, and pathways and learning gene representations via GNN, we identified ECM organization as a key feature in predicting bladder cancer-associated genes. Thus, ECM may serve not only as a regulator in bladder cancer treatment but also as a therapeutic target, holding significant value for diagnosis and prognosis. In-depth study of ECM could enhance our understanding of critical factors and mechanisms in bladder cancer development and progression.