Method Article
* These authors contributed equally
This study evaluates prognostic systems for colorectal signet-ring cell carcinoma patients using machine learning models and competing risk analyses. It identifies log odds of positive lymph nodes as a superior predictor compared to pN staging, demonstrating strong predictive performance and aiding clinical decision-making through robust survival prediction tools.
Lymph node status is a critical prognostic predictor for patients; however, the prognosis of colorectal signet-ring cell carcinoma (SRCC) has garnered limited attention. This study investigates the prognostic predictive capacity of the log odds of positive lymph nodes (LODDS), lymph node ratio (LNR), and pN staging in SRCC patients using machine learning models (Random Forest, XGBoost, and Neural Network) alongside competing risk models. Relevant data were extracted from the Surveillance, Epidemiology, and End Results (SEER) database. For the machine learning models, prognostic factors for cancer-specific survival (CSS) were identified through univariate and multivariate Cox regression analyses, followed by the application of three machine learning methods-XGBoost, RF, and NN-to ascertain the optimal lymph node staging system. In the competing risk model, univariate and multivariate competing risk analyses were employed to identify prognostic factors, and a nomogram was constructed to predict the prognosis of SRCC patients. The area under the receiver operating characteristic curve (AUC-ROC) and calibration curves were utilized to assess the model's performance. A total of 2,409 SRCC patients were included in this study. To validate the effectiveness of the model, an additional cohort of 15,122 colorectal cancer patients, excluding SRCC cases, was included for external validation. Both the machine learning models and the competing risk nomogram exhibited strong performance in predicting survival outcomes. Compared to pN staging, the LODDS staging systems demonstrated superior prognostic capability. Upon evaluation, machine learning models and competing risk models achieved excellent predictive performance characterized by good discrimination, calibration, and interpretability. Our findings may assist in informing clinical decision-making for patients.
Colorectal cancer (CRC) ranks as the third most prevalent malignant tumor globally1,2,3. Signet ring cell carcinoma (SRCC), a rare subtype of CRC, comprises approximately 1% of cases and is characterized by abundant intracellular mucin displacing the cell nucleus1,2,4. SRCC is often associated with younger patients, has a higher prevalence in females, and has advanced tumor stages at diagnosis. Compared to colorectal adenocarcinoma, SRCC shows poorer differentiation, a higher risk of distant metastasis, and a 5-year survival rate of only 12%-20%5,6. Developing an accurate and effective prognostic model for SRCC is crucial for optimizing treatment strategies and improving clinical outcomes.
This study aims to construct a robust prognostic model for SRCC patients using advanced statistical approaches, including machine learning (ML) and competing risk models. These methodologies can accommodate complex relationships in clinical data, offering individualized risk assessments and surpassing traditional methods in predictive accuracy. Machine learning models, such as Random Forest, XGBoost, and Neural Networks, excel in processing high-dimensional data and identifying intricate patterns. Studies have shown that AI models effectively predict survival outcomes in colorectal cancer, emphasizing ML's potential in clinical applications7,8. Complementing ML, competing risk models address multiple event types, such as cancer-specific mortality versus other causes of death, to refine survival analysis. Unlike traditional methods like the Kaplan-Meier estimator, competing risk models accurately estimate the marginal probability of events in the presence of competing risks, providing more precise survival assessments8. Integrating ML and competing risk analysis enhances predictive performance, offering a powerful framework for personalized prognostic tools in SRCC9,10,11.
Lymph node metastasis significantly influences prognosis and recurrence in CRC patients. While N-stage assessment in the TNM classification is critical, inadequate lymph node examination -- reported in 48%-63% of cases -- can lead to disease underestimation. To address this, alternative approaches like the lymph node ratio (LNR) and the log odds of positive lymph nodes (LODDS) have been introduced. LNR, the ratio of positive lymph nodes (PLNs) to total lymph nodes (TLNs), is less affected by TLN count and serves as a prognostic factor in CRC. LODDS, the logarithmic ratio of PLNs to negative lymph nodes (NLNs), has shown superior predictive ability in both gastric SRCC and colorectal cancer10,11. Machine learning has been increasingly applied in oncology, with models improving risk stratification and prognostic predictions across various cancers, including breast, prostate, and lung cancers12,13,14. However, its application in colorectal SRCC remains limited.
This study seeks to bridge this gap by integrating LODDS with ML and competing risk models to create a comprehensive prognostic tool. By evaluating the prognostic value of LODDS and leveraging advanced predictive techniques, this research aims to enhance clinical decision-making and improve outcomes for SRCC patients.
This study does not refer to ethical approval and consent to participate. The data used in this study was obtained from databases. We included patients diagnosed with colorectal signet-ring cell carcinoma from 2004 to 2015, as well as other types of colorectal cancer. Exclusion criteria included patients with a survival time of less than one month, those with incomplete clinicopathological information, and cases where the cause of death was unclear or unspecified.
1. Data acquisition
2. ML models development and verification
3. Competing risk model development and verification
Patients characteristics
This study focused on patients diagnosed with colorectal SRCC, using data from the SEER database spanning 2004 to 2015. Exclusion criteria included patients with a survival time of less than one month, those with incomplete clinicopathological information, and cases where the cause of death was unclear or unspecified. A total of 2409 colorectal SRCC patients who met the inclusion criteria were randomly divided into a training cohort (N = 1686) and a validation cohort (N = 723). The demographic and clinical parameters of the training and validation cohorts were analyzed using R software, as shown in Table 1. Among all the patients included, the majority were over 60 years old, with a similar number of male and female patients. Most patients were white. Over half of the patients (56%) were married. The majority of tumors were graded as III-IV (76%). Most patients (82%) had tumor sizes larger than 3.5 centimeters, and the majority of patients belonged to the LODDS1 group (42%). In the entire cohort, a high proportion of patients (53%) received chemotherapy. Primary tumors were mainly located in the right colon (67%). After randomization, there were no significant differences in baseline characteristics between the two groups statistically.
Identification of the prognostic clinical factors included in the ML models
We first screened significant variables for inclusion in the machine learning model using Cox regression analysis. The results of univariate Cox regression showed that survival time was significantly correlated with certain clinical variables, including sex, age, race, marital status, AJCC staging, pT staging, pN staging, pM staging, tumor size, CEA level, LNR classification, LODDS classification, and whether the patient received radiotherapy or chemotherapy (Table 2). Notably, LNR, LODDS, and pN staging all exhibited statistically significant hazard ratios (HR), indicating that these three LN staging systems are associated with prognosis. Subsequent multivariate Cox regression analysis was conducted to further determine the association between pN staging, LODDS, LNR, and CSS in SRCC patients. The results indicated that LODDS, LNR, and pN status significantly affected the CSS of SRCC patients (Figure 2).
Comparison of LN Systems
The predictive prognostic abilities of the three LN systems were similar in both the training, validation, and external validation cohorts (samples of colorectal cancer other than ring cell carcinoma, which were selected in step 1.4.; Table 3). In the training cohort, the C indices for LNR, LODDS, and pN were 0.309, 0.308, and 0.337, respectively, while in the validation cohort, the C indices were 0.288, 0.279, and 0.319, respectively. In the external validation cohort, the C-indices for LNR, LODDS, and pN were 0.419, 0.420, and 0.424, respectively. Additionally, the AIC values for each system in the training cohort were 12667.56, 12670.57, and 12731.89, and in the validation cohort, they were 4575.36, 4559.13, and 4613.20, respectively. In the external validation cohort, the AIC values for each system were as follows: 106554.68 for LODDS, 106581.85 for LNR, and 106915.45 for pN staging. These findings indicate that there are minimal differences in the discriminatory quality among the three systems. Therefore, we utilized machine learning methods-RF, XGBoost, and NN-to further determine the optimal LN system in terms of predictive capability. This analysis included variables that were significant in univariate Cox regression and at least one multivariate model (pN, LODDS, or LNR) for constructing the machine learning models, including pN, pT, pM, age, race, LNR classification, LODDS classification, and whether the patient received radiotherapy or chemotherapy.
We constructed RF, XGBoost, and NN models using the training dataset. The importance values for each variable are shown in Figure 3. In RF and XGBoost, LNR exhibited the highest importance, while LODDS also demonstrated considerable importance. However, in the NN model, LODDS displayed better predictive capability compared to pN and LNR. Considering the combined results from the three machine learning approaches, we conclude that the LODDS system may be the best system for assessing LN status in SRCC patients.
Performance of the ML models
As shown in Table 4 and Figure 4A-C, all three models were able to predict prognosis effectively, and the AUCs of the three models ranged from 0.777 to 0.851 in the test dataset (XGBoost: AUC = 0.820, 95% CI =0.789-0.851; RF: AUC = 0.819, 95% CI = 0.788-0.850; NN: AUC = 0.809, 95% CI = 0.777-0.841). The XGBoost, RF, and NN models showed great specificity (0.82, 0.825, 0.815), and accuracy (0.762, 0.763, 0.757). The calibration curves are shown in Figure 5D-F.
Construction and validation of competitive risk model
Given that the ML models did not account for the influence of competing risk factors, we constructed a competing risk model to further identify the LN system that performed best in terms of predictive capability. Cancer-specific survival (CSS) represents death due to cancer, while overall survival (OSS) accounts for deaths from other causes, serving as competing risk events. We used univariate and multivariate competing risk models to analyze the predictive factors for CSS in the training cohort. The univariate competing risk model indicated that the predictive factors for CSS included sex, age, race, marital status, AJCC staging, TNM staging, tumor size, LNR classification, LODDS classification, CEA level, whether the patient received radiotherapy or chemotherapy, and the location of the primary tumor. Finally, multivariate competing risk analysis, T staging, N staging, M staging, LODDS classification, and the location of the primary tumor were identified as five independent prognostic markers for colorectal SRCC patients. The results of univariate and multivariate competing risk analyses are shown in Table 5, and the corresponding cumulative incidence function (CIF) curves for the independent risk factors are presented in Figure 5. Based on the five significant variables (T staging, N staging, M staging, LODDS classification, and the location of the primary tumor), we developed a prognostic nomogram (Figure 6A). We found that, compared to pN, LODDS showed a higher weight. This finding aligns with the previous results, suggesting that the LODDS system is the best system for assessing LN status in SRCC patients.
To assess the accuracy of the model, we constructed calibration curves (Figures 6B-D). The results indicated that the model performed well in predicting the total survival of patients at 1, 3, and 5 years. The fit of the curves to the 45° line demonstrate a strong consistency of the model. The ROC curve evaluation results for the nomogram (Figures 5E-G) showed that the area under the curve (AUC) for predictions at 1, 3, and 5 years was greater than 0.75. These results suggest that the prediction curves for 3 and 5 years demonstrated significant benefits, indicating that the nomogram has valuable clinical applications and reference value.
Figure 1: Flow diagram presenting the screening process in the SEER database. Through our inclusion and exclusion criteria, we successfully selected colorectal cancer patients from the SEER database for a subsequent series of analyses based on R. Please click here to view a larger version of this figure.
Figure 2: Association of LNR, LODDS, and pN staging with CSS in the training cohort. This figure illustrates the multivariate Cox regression analysis results for (A) LNR, (B) LODDS, and (C) pN staging, evaluated alongside other independent prognostic factors. The analysis includes Hazard Ratios (HR) and 95% confidence intervals (CI). The results demonstrate that LNR, LODDS, and pN status are significant prognostic factors for cancer-specific survival in patients with SRCC, with all HR values showing statistical significance (p < 0.05). *p < 0.05, **p < 0.01, ***p < 0.001. Error bars represent 95% CI. Please click here to view a larger version of this figure.
Figure 3: Relative importance of variables. (A) XGBoost model, (B) RF model, and (C) NN model. This figure evaluates the relative importance of variables. In the XGBoost and RF models, LNR exhibited the highest importance, with LODDS also showing considerable significance. Conversely, in the NN model, LODDS demonstrated superior predictive capability compared to pN and LNR. Based on the combined results of the three models, the LODDS system is suggested to be the most effective for assessing LN status in SRCC patients. Please click here to view a larger version of this figure.
Figure 4: ROC curves and calibration curves of the ML models, XGBoost, and eXtreme gradient boosting. (A, D) the training cohort, (B, E) the validation cohort, and (C, F) the external validation cohort. The area under the curve (AUC) value closer to 1 indicates better classification performance of the model. The error bars represent the 95% confidence interval for the predicted probability of the actual event occurring. Please click here to view a larger version of this figure.
Figure 5: Cumulative incidence estimates of death according to patient characteristics on colorectal SRCC. CIF of subgroups. Gray's test was applied. Please click here to view a larger version of this figure.
Figure 6: Nomogram development and validation for CSS in colorectal SRCC patients. (A) A competing risk nomogram predicting 1-year, 3-year, and 5-year cancer-specific survival probability of patients with colorectal SRCC. The calibration curves of the nomograms for predicting 1-, 3-, 5-year cancer-specific survival in the (B) training cohort, (C) validation cohort, and (D) external validation cohort. Receiver operating characteristic curves for predicting 1-, 3-, and 5-year cancer-specific survival in the (E) training cohort, (F) validation cohort, and (G) external validation cohort. Please click here to view a larger version of this figure.
Table 1: Clinical characteristics of patients with colorectal SRCC. Please click here to download this Table.
Table 2: Univariate Cox regression analysis of CSS in the training cohort. Please click here to download this Table.
Table 3: Prediction performance of the three lymph nodal staging systems. Please click here to download this Table.
Table 4: Predictive performance of the models in the validation cohort and external validation cohorts. Please click here to download this Table.
Table 5: Univariate and multivariate competing risk analysis for cancer-specific mortality of colorectal SRCC in the training cohort. Please click here to download this Table.
Colorectal cancer (CRC) SRCC is a rare and special subtype of colorectal cancer with a poor prognosis. Therefore, greater attention needs to be paid to the prognosis of SRCC patients. Accurate survival prediction for SRCC patients is crucial for determining their prognosis and making individualized treatment decisions. In this study, we explored the relationship between clinical features and prognosis in SRCC patients and identified the optimal LN staging system for SRCC patients from the SEER database. To our knowledge, this is the first study to determine the suitable LN system for colorectal SRCC patients through a comprehensive use of machine learning and competing risk analysis methods and to construct a nomogram for prognostic prediction.
The number of metastatic LNs in CRC patients is an important indicator of prognosis and recurrence. Accurate LN staging plays a key role in determining treatment strategies and prognosis for SRCC patients. LNR and LODDS are alternative methods used to assess LN involvement in GC, improve staging systems, and provide more accurate prognostic information10,13. We revealed the correlation between LODDS, LNR, and pN staging with CSS in SRCC patients using the SEER database. The predictive abilities of these three LN systems (LNR, LODDS, and pN) were compared using AUCs, AICs, BICs, and C indices. However, the differences between them were minimal. Therefore, we used three machine learning methods-Xgboost, RF, and NN-to select the most important features as the optimal LN system. Based on the combined results of the three methods, we defined LODDS as the appropriate LN system.
However, OSS is a competing risk event that affects the prognosis of CSS patients. The process of screening variables for inclusion in the machine learning model using the Cox regression method did not consider the impact of OSS, which may lead to an inaccurate assessment of risk ratios12. Therefore, to further determine the optimal LN assessment system for SRCC patients, we constructed a competing risk model. The results once again confirmed that the LODDS staging system provides more accurate prognostic information compared to the pN system. During the follow-up, among 2409 patients, 1339 (56%) died of CSS, and 464 (19%) died of OSS. Furthermore, we also developed a competing risk chart to predict the cancer-specific mortality rates at 1 year, 3 years, and 5 years. We believe that this model has significant implications for clinical research involving colorectal SRCC patients. Although the American Joint Committee on Cancer recommends the TNM system as the staging system for all histological types of colorectal cancer, it is primarily used for staging colorectal adenocarcinoma. The AJCC N stage is limited by TLN, while LNR does not consider the impact of NLN13,14,15,16. Reports indicate that LODDS is less affected by TLN and considers the number of NLN10,17. Scarinci et al. demonstrated that LODDS predicts OS in CRC patients better than LNR and pN staging and suggested that future research should validate its role in different CRC subtypes18. In this study, we found that LODDS has a significant prognostic predictive effect on CSS in colorectal SRCC. Therefore, LODDS may be a valuable tool for assessing lymph node dissection and prognosis in colorectal SRCC patients. Although no optimal threshold has been established for LODDS yet, it has proven to be the most reliable LN staging system. With increasing attention to LODDS, it is generally believed that it will gain wide recognition in clinical settings in the foreseeable future.
Our study found that the location of the primary tumor is an important predictive factor for CSS, with a significantly poorer prognosis for rectal SRCC, which is consistent with previous studies12,19,20. Rectal SRCC may have unique clinical, pathological, and molecular characteristics21,22, warranting further investigation. The charts derived from the predictive models are key and effective tools for clinical decision-making and patient counseling. To our knowledge, this study is the first to integrate ML models and competing risk models to explore the optimal LN staging system for SRCC patients. We developed and validated three ML models to predict the prognosis of SRCC patients. In the test dataset, XGBoost, RF, and NN models showed good prognostic predictive performance based on AUC values and corresponding metrics. Thus, ML models can assist in treatment decisions for SRCC patients by predicting prognosis. Additionally, we generated competing risk charts based on proportional hazard models to analyze predictive factors for colorectal SRCC and assess the role of LODDS within it. We used the C index and calibration curves to evaluate the predictive performance of the nomogram. The chart displayed common clinical variables, such as tumor primary site and LODDS grouping. Furthermore, the nomogram we constructed is an effective method for predicting the 1 year, 3 year, and 5 year CSS in colorectal SRCC patients. This tool can aid clinicians in performing accurate, thorough, and timely prognostic assessments for each colorectal SRCC patient, enabling them to formulate personalized treatment plans23.
Finally, this study has several limitations. First, the patients in the study were diagnosed between 2004 and 2015, leading to a relatively short follow-up period. We anticipate that a longer follow-up period would help improve the accuracy of model predictions. Second, the study design used here is retrospective and relies on data obtained from the SEER database, which may introduce some inherent biases. Some information, such as the location of metastatic LNs, was not recorded. Lastly, the majority of patients in this study were white, necessitating broader studies involving diverse populations to confirm and reinforce these findings.
Conclusion
The study found that the LODDS exhibits strong prognostic predictive ability for colorectal SRCC. Building on this foundation, we developed a nomogram based on a competing risk model to predict overall survival rates at 1 year, 3 year, and 5 year intervals for colorectal SRCC patients. Following a series of evaluations and internal validations, the nomogram has demonstrated significant clinical applicability and value, providing guidance for clinicians in treatment decision-making. Additionally, we constructed three ML models. These ML approaches have the potential to enhance prognostic predictive capabilities for SRCC and assist physicians in understanding how ML can be utilized to optimize treatment and follow-up strategies.
The authors have no financial conflicts of interest to disclose.
None
Name | Company | Catalog Number | Comments |
SEER database | National Cancer institiute at NIH | ||
X-tile software | Yale school of medicine | ||
R-studio | Posit |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved