Machine learning for healthcare: a data-centric approach
Arquivos
Data
2024-06-25
Tipo
Tese de doutorado
Título da Revista
ISSN da Revista
Título de Volume
Resumo
Machine learning models have the potential to revolutionize the healthcare sector by leveraging continuously collected data in health systems. Traditionally, these models are trained on large datasets, with performance improvements achieved through robust models and hyperparameter tuning. In this work, we propose a data-centric approach focusing on improving the data itself. Throughout this research, a set of health-related databases was created. These databases originate from four distinct sources, encompassing the prediction of severe cases of COVID-19 and dengue, as well as the authorization of specialized care in the public health system in Brazil. The datasets created cover seven predictive tasks, each with separate training and testing data. All problems were designed as binary classification tasks and adopted tabular data. The datasets were initially characterized in relation to their hardness profiles, using a specific hardness measure proposed in previous works. This measure considers the probability of an instance being misclassified by different machine learning algorithms. Our analysis considered seven classifiers with distinct biases: Gradient Boosting, Random Forest, Logistic Regression, Multilayer Perceptron, Support Vector Classifier (with linear and RBF kernels), and Bagging. The models were evaluated using a set of metrics, area under the ROC curve and per-class recall and precision, to provide a holistic consideration of model performance. We proposed a new approach to generate post-hoc explanations for machine learning models. In this approach, we identified instances where the models are most likely to fail, offering data-centric explanations for such failures. The patterns found explain the model errors, resulting in greater confidence in the predictions made. Additionally, we present a case study where instance hardness analysis was adopted to improve the design of a prediction problem in collaboration with the data specialist. Our work demonstrated that through this approach, it was possible to improve data quality and, ultimately, model performance. Finally, we propose a generalized approach to enhance model performance when access to data experts is not possible. A two-step strategy was adopted: first, cleaning the training data based on instance difficulty values, and then introducing a reject option when the models did not offer high-confidence predictions for test instances. The results show that it is possible to improve model performance at the cost of rejecting instances from the test set.