Imputation and Hyperparameter Optimization in Cancer Diagnosis
by Yi Liu1 , Wendy Wang2,* , Haibo Wang3
1 Department of Computer and Information Science, University of Massachusetts Dartmouth, MA 02747 USA
2 Department of Computer Science and Information Systems, University of North Alabama, Florence, AL 35632 USA
3 Division of International Business and Technology Studies, Texas A&M International University, TX 78041 USA
* Author to whom correspondence should be addressed.
Journal of Engineering Research and Sciences, Volume 2, Issue 8, Page # 1-18, 2023; DOI: 10.55708/js0208001
Keywords: Machine Learning, Cervical Cancer, Imputation, Hyperparameter Optimization
Received: 18 April 2023, Revised: 03 July 2023, Accepted: 17 August 2023, Published Online: 28 August 2023
APA Style
Liu, Y., Wang, W., & Wang, H. (2023). Imputation and Hyperparameter Optimization in Cancer Diagnosis. Journal of Engineering Research and Sciences, 2(8), 1–18. https://doi.org/10.55708/js0208001
Chicago/Turabian Style
Liu, Yi, Wendy Wang, and Haibo Wang. “Imputation and Hyperparameter Optimization in Cancer Diagnosis.” Journal of Engineering Research and Sciences 2, no. 8 (August 1, 2023): 1–18. https://doi.org/10.55708/js0208001.
IEEE Style
Y. Liu, W. Wang, and H. Wang, “Imputation and Hyperparameter Optimization in Cancer Diagnosis,” Journal of Engineering Research and Sciences, vol. 2, no. 8, pp. 1–18, Aug. 2023, doi: 10.55708/js0208001.
Cancer is one of the leading causes for death worldwide. Accurate and timely detection of cancer can save lives. As more machine learning algorithms and approaches have been applied in cancer diagnosis, there has been a need to analyze their performance. This study has compared the detection accuracy and speed of nineteen machine learning algorithms using a cervical cancer dataset. To make the approach general enough to detect various types of cancers, this study has intentionally excluded feature selection, a feature commonly applied in most studies for a specific dataset or a certain type of cancer. In addition, imputation and hyperparameter optimization have been employed to improve the algorithms’ performance. The results suggest that when both imputation and hyperparameter optimization are applied, the algorithms tend to perform better than when either of them is employed individually or when both are absent. The majority of the algorithms have shown improved accuracy in diagnosis, although with the trade-off of increased execution time. The findings from this study demonstrate the potential of machine learning in cancer diagnosis, especially the possibility of developing versatile systems that are able to detect various types of cancers with satisfactory performance.
- World Health Organization, “Cancer: Key facts,” https://www.who.int/news-room/fact-sheets/detail/cancer, 2022.
- World Health Organization, “Global Strategy on Human Resources for Health: Workforce 2030: Reporting at Seventy-fifth World Health Assembly,” https://www.who.int/news/item/ 02-06-2022- global-strategy-on-human-resources-for-health–workforce-2030, 2022.
- J. A. Cruz and D. S. Wishart, “Applications of Machine Learning in Cancer Prediction and Prognosis”. Cancer Informatics 2, p:59–77, 2006. DOI: 10.1177/117693510600200030
- K. Wan, C. H. Wong, H. F. Ip, D. Fan, P. L. Yuen, H. Y. Fong, and
M. Ying, “Evaluation of the Performance of Traditional Machine Learning Algorithms, Convolutional Neural Network and Automl Vision in Ultrasound Breast Lesions Classification: a Comparative Study,” Quantitative imaging in medicine and surgery, vol. 11, no. 4, pp:1381–1393, 2021. DOI: 10.21037/qims-20-922 - S. Hussein, P. Kandel, C. W. Bolan, M. B. Wallace, and U. Bagci, “Lung and Pancreatic Tumor Characterization in the Deep Learning Era: Novel Supervised and Snsupervised Learning Approaches,” IEEE Transactions on Medical Imaging, vol. 38, pp:1777–1787, 2019.
- K. Fernandes, J. S. Cardoso, and J. C. Fernandes, “Transfer learn- ing with partial observability applied to cervical cancer screening,” in Iberian Conference on Pattern Recognition and Image Analysis, 2017. DOI: 10.1007/978-3-319-58838-4_27
- Intel-mobileodt, “Intel & MobileODT Cervical Cancer Screen-s://kaggle.com/competitions/intel-mobileodt-cervical-
cancer-screening, 2017. - M. M. Ali, K. Ahmed, F. M. Bui, B. K. Paul, S. M. Ibrahim, J.
M. W. Quinn, and M. A. Moni, “Machine Learning-based Sta- tistical Analysis for Early Stage Detection of Cervical Cancer,” Computers in Biology and Medicine, vol. 139, no. 104985, 2021.
DOI: 10.1016/j.compbiomed.2021.104985 - W. William, J. A. Ware, A. H. Basaza-Ejiri, and J. Obungoloch, “A Review of Image Analysis and Machine Learning Techniques for Automated Cervical Cancer Screening from Pap-smear Im- ages,” Computer Methods and Programs in Biomedicine, vol. 164, pp:15-22, 2018. DOI: 10.1016/j.cmpb.2018.05.034
- J. Lu, E. Song, A. Ghoneim, and M. Alrashoud, “Machine Learn- ing for Assisting Cervical Cancer Diagnosis: An Ensemble Ap- proach,” Future Generation Computer Systems, vol. 106, pp:199- 205, 2020. DOI: 10.1016/j.future.2019.12.033
- C. Luo, B. Liu, and J. Xia, “Comparison of Several Machine Learning Algorithms in the Diagnosis of Cervical Cancer,” in Inter- national Conference on Frontiers of Electronics, Information and Computation Technologies, 2021. DOI: 10.1145/3474198.3478165
- B. Nithya and V. Ilango, “Evaluation of Machine Learning Based Optimized Feature Selection Approaches and Classification Meth- ods for Cervical Cancer Prediction,” SN Applied Sciences vol. 1, 1–16, 2019. DOI: 10.1007/s42452-019-0645-7
- Y. R. Park, Y. J. Kim, W. Ju, K. Nam, S. Kim, and K. G. Kim, “Comparison of Machine and Deep Learning for the Classification of Cervical Cancer Based on Cervicography Images,” Scientific Reports, vol. 11, 2021. DOI: 10.1038/s41598-021-95748-3
- D. B. Rubin, “Multiple Imputation After 18+ Years,” Journal of the American Statistical Association, vol. 91, pp. 473–489, 1996.
DOI: 10.1080/01621459.1996.10476908 - M. Feurer, F. Hutter, “Hyperparameter optimization,” in: F. Hutter, L. Kotthoff, J. Vanschoren (Eds.), Automatic Machine Learning: Methods, Systems, Challenges, Springer, pp. 3–38, 2019.
DOI: 10.1007/978-3-030-05318-5_1 - J. Bergstra, R. Bardenet, Y. Bengio, and B. K´egl, “Algorithms for hyper-parameter optimization,” Advances in Neural Information Processing systems, vol. 24, 2011.
- L. Breiman, “Random Forests,” Machine Learning, vol. 45 pp. 5–32, 2004. DOI: 10.1023/A:1010933404324
- L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, “Classifi- cation and Regression Trees,” Brooks/Cole Publishing, Monterey, 1984. DOI: 10.1201/9781315139470
- Mayo, “Biopsy: Types of Biopsy Procedures Used to Diagnose Cancer,” https://www.mayoclinic.org/diseases-conditions/cancer/ in-depth/biopsy/art-20043922, 2021.
- O. Kramer, “Scikit-learn,” In: Machine Learning for Evolution Strategies, pp. 45–53. Springer, 2016. DOI: 10.1007/978-3-319- 33383-0_5
- J. Huo, Y. Xu, T. Sheu, R. Volk, and Y. Shih, “Complication Rates and Downstream Medical Costs Associated with Invasive Diagnostic Procedures for Lung Abnormalities in the Community Setting,” JAMA Internal Medicine vol. 179, no. 3, pp. 324-332, 2019. DOI: 10.1001/jamainternmed.2018.6277
- J. A. Hanley, B. J. McNeil, “The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve,” Ra- diology, vol. 143, no. 1, pp. 29–36, 1982. DOI: 10.1148/radiol- ogy.143.1.7063747
- ImbalancedLearn, “Balanced Random Forest Classifier,” https://imbalanced- learn.org/stable/references/generated/imblearn. ensemble.BalancedRandomForestClassifier.html, 2022.
- Z. Li, Y. Zhao, N. Botta, C. Ionescu, and X. Hu, “COPOD: Copula-Based Outlier Detection,” in Proceedings of IEEE Interna-
Mining (ICDM), pp. 1118–1123, 2020.
DM50108.2020.00135 - N. S. Altman, “An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression,” The American Statistician, vol. 46, no.3, pp.175-185, 1992. DOI: 10.1080/00031305.1992.10475879
- Y. Zhao, X. Hu, C. Cheng, C. Wang, C. Wan, W. Wang, J. Yang, H. Bai, Z. Li, C. Xiao, and Y. Wang, “SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection,” in Proceedings of Machine Learning and Systems, vol. 3, pp.463-478, 2021.
- D. Yarowsky, “Unsupervised Word Sense Disambiguation Rival- ing Supervised Methods,” In 33rd Annual Meeting of the Asso- ciation for Computational Linguistics, pp. 189-196, 1995. DOI: 10.3115/981658.981684
- X. ZhuГ´ and Z. GhahramaniГ´н, “Learning from Labeled and Unlabeled Data with Label Propagation,” ProQuest Number: IN- FORMATION TO ALL USERS, 2002.
- A. Lazarevic, V. Kumar, “Feature Bagging for Outlier Detection,” In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 2005. DOI: 10.1145/1081870.1081891
- Y. Freund, R. E. Schapire, “A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting,” In: Vit´anyi, P. (ed.) Computational Learning Theory, pp. 23–37. Springer, Berlin,
Heidelberg, 1995. DOI: 10.1006/jcss.1997.1504 - G. Ke, M. Qi Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu, “Lightgbm: A highly Efficient Gradient Boosting De- cision Tree,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- A. J. Izenman, “Linear Discriminant Analysis,” Springer New York, New York, NY, pp. 237–280, 2008. DOI: 10.1007/978-0-387- 78189-1_8
- W. Chen, Y. Chen, Y. Mao, B.-L. Guo, “Density-based Logistic Regression,” in Proceedings of the 19th ACM SIGKDD interna- tional conference on Knowledge Discovery and Data Mining, pp. 140-148, 2013. DOI: 10.1145/2487575.2487583
- J. D. Rennie, L. Shih, J. Teevan, D. R. Karger, “Tackling the Poor Assumptions of Naive Bayes Text Classifiers,” in Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 616–623, 2003.
- M. Popescu, V. E. Balas, L. Perescu-Popescu, and N. E. Mas- torakis, “Multilayer Perceptron and Neural Networks,” WSEAS Transactions on Circuits and Systems, vol. 8, no. 7, pp. 579-588, 2009.
- P.-H. Chen, C.-J. Lin, and B. Scho¨lkopf, “A tutorial on ν-Support Vector Machines,” Applied Stochastic Models in Business and Industry, vol. 21, no. 2, pp. 111–136, 2005. DOI: 10.1002/asmb.537
- C. Cortes and V. N. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, 273–297, 1995. DOI: 10.1007/BF00994018
- T. Chen and C. Guestrin, “Xgboost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International owledge Discovery and Data Mining, 2016. DOI:
39672.2939785