Final Project: Breast Cancer Recurrence Prediction
Shuyi Chen, Ruilin Wu
Introduction
Breast cancer is one of the most common malignancies affecting women worldwide, and its recurrence remains a serious concern, often occurring months or years after initial breast cancer treatment. Various machine learning models have been explored for predicting breast cancer recurrence, from Naive Bayes classifiers to complex deep learning approaches. While advanced models like CNNs and NLP-based systems show high accuracy, they often depend on unstructured data and complex infrastructure. In contrast, Kim et al. (2016) demonstrated that a Naive Bayesian model can effectively predict five-year recurrence, highlighting its clinical potential.
This report focuses on applying Bayesian Logistic Regression—an interpretable and widely used method—to evaluate its effectiveness in predicting breast cancer recurrence using a structured clinical dataset, aiming to support transparent and reliable clinical decision-making.
Data Description
The breast cancer dataset was obtained from the University Medical Centre, Institute of Oncology, located in Ljubljana, Yugoslavia. The dataset consists of 286 instances categorized into two distinct outcomes: 201 instances represent cases with no recurrence events, while 85 instances indicate cases with recurrence events. It is important to note that the dataset contains missing values, necessitating proper handling in subsequent analytical and modeling procedures.
Each instance within the dataset is described by nine attributes, which include both linear and categorical variables. The attributes include age (categorized in decade ranges from 10 to 99 years), menopause status (categorized as less than 40 years, greater or equal to 40 years, or premenopausal), tumor size (categorized in increments of five millimeters, ranging from 0 to 59 mm), the number of involved axillary lymph nodes (categorized in increments of three, ranging from 0 to 39), presence or absence of capsular invasion (binary classification of yes or no), degree of malignancy (ordinal scale from 1 to 3, indicating increasing severity), the affected breast side (left or right), tumor location within breast quadrants (left-upper, left-lower, right-upper, right-lower, or central), and the administration of irradiation therapy (binary classification of yes or no).
Exploratory Data Analysis
After removing rows with missing values, our dataset contained 277 observations. We also excluded two single-row categories—one instance of invnodes “24-26” and one instance of age “20-29”—since they each occurred only once and could cause modeling or cross-validation issues. This left us with a final total of 275 observations. The dataset shows a moderate imbalance in the outcome variable, with roughly 71% of cases being no-recurrence events and 29% recurrence events (Figure 1). Most patients fall within the 30–69 age range, and the menopausal status is split mainly between “premenopause” and “greater or equal to 40,” while the “less than 40” group remains sparse. The cross-tabulation (Table 1) indicates that most patients in both outcome groups are between 30 and 69 years old, with the highest counts in the 40–59 age ranges.
Methodology
We fit a Bayesian logistic regression model (using stan_glm() in the rstanarm package) to predict breast cancer recurrence, employing the binomial family with a logit link. Each predictor—age, menopause, tumorsize, invnodes, nodecaps, degmalig, breast, breastquad, and irradiat—was included as a fixed effect. We applied weakly informative normal priors (μ = 0, σ = 2.5) for coefficients and (μ = 0, σ = 5) for the intercept, without imposing overly strong assumptions on effect sizes.
Four Markov Chain Monte Carlo (MCMC) chains were run for 10,000 iterations each, and we evaluated convergence via the Gelman–Rubin statistic (R-hat) and effective sample size. The posterior predictive check (Figure 2) suggested that the model adequately captured the distribution of observed outcomes, supporting its overall fit to the data. Furthermore, the MCMC density plot (Figure 3) demonstrated well-behaved sampling behavior across all parameters, with consistent overlap between chains and unimodal posterior distributions. After confirming convergence, we extracted 80% credible intervals for the model’s coefficients (and exponentiated these to interpret them as odds ratios). Model performance was assessed by computing predicted probabilities for each observation, then constructing a ROC curve and identifying an optimal classification threshold using Youden’s index (Figure 4). Finally, a 10-fold cross-validation (classification_summary_cv()) was performed at the chosen threshold (0.33) to estimate sensitivity, specificity, and overall accuracy in an out-of-sample context.
Results
All chains converged satisfactorily, with R-hat and sufficiently large effective sample sizes. The 80% credible intervals for the exponentiated coefficients reveal that invnodes3–5, invnodes6–8, invnodes9–11, and degmalig lie entirely above 1, indicating a higher probability of recurrence compared to their reference categories. In contrast, tumorsize10–14 sits entirely below 1, suggesting lower recurrence odds than the reference tumor‐size bin. Most other predictors, such as age brackets, menopausal status, most tumor‐size categories, have intervals crossing 1, meaning the model cannot rule out either increased or decreased risk given the data—these effects remain uncertain. Finally, the intercept’s exponentiated interval falling far below 1 indicates that, when all predictors are at their reference levels, the baseline odds of recurrence are quite low; other predictors then raise or lower this low baseline as indicated by their respective odds ratios.
A ROC curve analysis on the entire dataset suggested a 0.33 probability threshold for optimal classification (Figure 4). Applying this threshold in the 10-fold cross-validation produced a sensitivity of roughly 67.1%, specificity of about 71.3%, and an overall accuracy near 69.5%.
Conclusion
In the context of this breast cancer dataset, the Bayesian logistic regression model provided a moderate level of predictive power, aligning with expectations for a relatively small clinical dataset. The model’s interpretability is its key advantage: practitioners can directly review posterior intervals and odds ratios to understand how clinical features—particularly tumor size, lymph node involvement, and degree of malignancy—influence breast cancer recurrence risk. While most variables had limited impact or wider credible intervals due to sparse data, the approach still offered transparent risk stratification. Future work might include incorporating additional patient features such as genetic markers or exploring more robust priors to potentially enhance performance.
References
Kim, Woojae, et al. “Nomogram of Naive Bayesian Model for Recurrence Prediction of Breast Cancer.” Healthcare Informatics Research, vol. 22, no. 2, 30 Apr. 2016, p. 89, https://doi.org/10.4258/hir.2016.22.2.89.
Wang, Hanyin, et al. “Prediction of Breast Cancer Distant Recurrence Using Natural Language Processing and Knowledge-Guided Convolutional Neural Network.” Artificial Intelligence in Medicine, vol. 110, 1 Nov. 2020, p. 101977, https://doi.org/10.1016/j.artmed.2020.101977.
Appendix
Figure 1. Distribution of Recurrence Events
Figure 2. PP Check Plot
Figure 3. MCMC Density Plot
Figure 4. ROC Curve