Classification of breast cancer using ensemble machine learning with apache spark

KROTHA, Durga Pujitha; SHAIK, Fathimabi

doi:10.14744/sigma.2025.00126

Classification of breast cancer using ensemble machine learning with apache spark

Durga Pujitha KROTHA ¹

, Fathimabi SHAIK ¹

¹Department of Information Technology, Velagapudi Ramakrishna Siddhartha Engineering College, Vijayawada, India.

Sigma J Eng Nat Sci 2025; 43(4): 1385-1399 DOI: 10.14744/sigma.2025.00126

Full Text PDF

Abstract

Breast cancer is one of the most common and serious problem affecting people around the world. Detecting it early and correctly identifying whether a tumor is benign or malignant. In this study, we developed a new model called the Logistic Ensemble Fusion Model to improve the accuracy of Breast cancer diagnosis. This model combines the strengths of three different machine learning models, specifically Support Vector Machine, Decision Tree, and Logistic Regression, into a powerful ensemble approach, significantly improving over traditional methods. We used Apache Spark with its Python API to handle large datasets quickly and efficiently. To select the important features for making predictions, we used a method called Recursive Feature Elimination (RFE), with the help of both a Support Vector Machine (SVM-RFE) and Random Forest (RF-RFE). We tested our model by dividing the data into training and testing sets in an 80:20 ratio. The Logistic Ensemble Fusion Model achieved an accuracy of 99.13%, precision of 98.71%, recall of 99.91%, and an F1 score of 99.12%. The entire process, which involved running 12 Spark jobs, was completed in 38 seconds. Compared to other models like Random Forest, Gradient Boosting, Factorization Machine, One-vs-Rest, and Multilayer Perceptron. The main innovation of this study is the use of multiple machine learning models in a unified ensemble fusion approach, providing classification performance and demonstrating significant advancement over previous methods. This study underscores the potential of advanced ensemble machine learning techniques and big data technologies in refining breast cancer diagnosis and supporting more effective clinical decision-making.

Keywords: Breast Cancer; Classification, Ensemble Methods; Feature Selection; Machine Learning; Spark