Machine learning for healthcare: a data-centric approach

Valeriano, Maria Gabriela [UNIFESP]

Machine learning for healthcare: a data-centric approach

dc.contributor.advisor	Lorena, Ana Carolina
dc.contributor.advisor-co	Kiffer, Carlos Roberto Veiga [UNIFESP]
dc.contributor.advisor-coLattes	http://lattes.cnpq.br/7021893874375037
dc.contributor.advisorLattes	http://lattes.cnpq.br/3451628262694747
dc.contributor.author	Valeriano, Maria Gabriela [UNIFESP]
dc.contributor.authorLattes	http://lattes.cnpq.br/7462488231975857
dc.coverage.spatial	São José dos Campos, SP
dc.date.accessioned	2024-08-28T14:00:40Z
dc.date.available	2024-08-28T14:00:40Z
dc.date.issued	2024-06-25
dc.description.abstract	Machine learning models have the potential to revolutionize the healthcare sector by leveraging continuously collected data in health systems. Traditionally, these models are trained on large datasets, with performance improvements achieved through robust models and hyperparameter tuning. In this work, we propose a data-centric approach focusing on improving the data itself. Throughout this research, a set of health-related databases was created. These databases originate from four distinct sources, encompassing the prediction of severe cases of COVID-19 and dengue, as well as the authorization of specialized care in the public health system in Brazil. The datasets created cover seven predictive tasks, each with separate training and testing data. All problems were designed as binary classification tasks and adopted tabular data. The datasets were initially characterized in relation to their hardness profiles, using a specific hardness measure proposed in previous works. This measure considers the probability of an instance being misclassified by different machine learning algorithms. Our analysis considered seven classifiers with distinct biases: Gradient Boosting, Random Forest, Logistic Regression, Multilayer Perceptron, Support Vector Classifier (with linear and RBF kernels), and Bagging. The models were evaluated using a set of metrics, area under the ROC curve and per-class recall and precision, to provide a holistic consideration of model performance. We proposed a new approach to generate post-hoc explanations for machine learning models. In this approach, we identified instances where the models are most likely to fail, offering data-centric explanations for such failures. The patterns found explain the model errors, resulting in greater confidence in the predictions made. Additionally, we present a case study where instance hardness analysis was adopted to improve the design of a prediction problem in collaboration with the data specialist. Our work demonstrated that through this approach, it was possible to improve data quality and, ultimately, model performance. Finally, we propose a generalized approach to enhance model performance when access to data experts is not possible. A two-step strategy was adopted: first, cleaning the training data based on instance difficulty values, and then introducing a reject option when the models did not offer high-confidence predictions for test instances. The results show that it is possible to improve model performance at the cost of rejecting instances from the test set.
dc.description.sponsorship	Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
dc.description.sponsorshipID	8887.507037/2020-00
dc.emailadvisor.custom	aclorena@gmail.com
dc.format.extent	214 f.
dc.identifier.uri	https://hdl.handle.net/11600/71671
dc.language	eng
dc.publisher	Universidade Federal de São Paulo
dc.rights	info:eu-repo/semantics/restrictedAccess
dc.subject	Machine learning
dc.subject	Healthcare
dc.subject	Data-centric
dc.title	Machine learning for healthcare: a data-centric approach
dc.title.alternative	Aprendizado de máquina para a área da saúde: uma abordagem centrada nos dados
dc.type	info:eu-repo/semantics/doctoralThesis
unifesp.campus	Instituto de Ciência e Tecnologia (ICT)
unifesp.graduateProgram	Pesquisa Operacional
unifesp.knowledgeArea	Ciência de dados
unifesp.researchArea	Aprendizado de máquina

Arquivos

Pacote Original

Agora exibindo 1 - 1 de 1

Nome:: Tese_mgabriela.pdf
Tamanho:: 5.12 MB
Formato:: Adobe Portable Document Format
Descrição:

Baixar

Licença do Pacote

Agora exibindo 1 - 1 de 1

Nome:: license.txt
Tamanho:: 5.55 KB
Formato:: Item-specific license agreed upon to submission
Descrição:

Baixar

Coleções

PPG - Pesquisa Operacional