Machine learning for healthcare: a data-centric approach

dc.contributor.advisorLorena, Ana Carolina
dc.contributor.advisor-coKiffer, Carlos Roberto Veiga [UNIFESP]
dc.contributor.advisor-coLatteshttp://lattes.cnpq.br/7021893874375037
dc.contributor.advisorLatteshttp://lattes.cnpq.br/3451628262694747
dc.contributor.authorValeriano, Maria Gabriela [UNIFESP]
dc.contributor.authorLatteshttp://lattes.cnpq.br/7462488231975857
dc.coverage.spatialSão José dos Campos, SP
dc.date.accessioned2024-08-28T14:00:40Z
dc.date.available2024-08-28T14:00:40Z
dc.date.issued2024-06-25
dc.description.abstractMachine learning models have the potential to revolutionize the healthcare sector by leveraging continuously collected data in health systems. Traditionally, these models are trained on large datasets, with performance improvements achieved through robust models and hyperparameter tuning. In this work, we propose a data-centric approach focusing on improving the data itself. Throughout this research, a set of health-related databases was created. These databases originate from four distinct sources, encompassing the prediction of severe cases of COVID-19 and dengue, as well as the authorization of specialized care in the public health system in Brazil. The datasets created cover seven predictive tasks, each with separate training and testing data. All problems were designed as binary classification tasks and adopted tabular data. The datasets were initially characterized in relation to their hardness profiles, using a specific hardness measure proposed in previous works. This measure considers the probability of an instance being misclassified by different machine learning algorithms. Our analysis considered seven classifiers with distinct biases: Gradient Boosting, Random Forest, Logistic Regression, Multilayer Perceptron, Support Vector Classifier (with linear and RBF kernels), and Bagging. The models were evaluated using a set of metrics, area under the ROC curve and per-class recall and precision, to provide a holistic consideration of model performance. We proposed a new approach to generate post-hoc explanations for machine learning models. In this approach, we identified instances where the models are most likely to fail, offering data-centric explanations for such failures. The patterns found explain the model errors, resulting in greater confidence in the predictions made. Additionally, we present a case study where instance hardness analysis was adopted to improve the design of a prediction problem in collaboration with the data specialist. Our work demonstrated that through this approach, it was possible to improve data quality and, ultimately, model performance. Finally, we propose a generalized approach to enhance model performance when access to data experts is not possible. A two-step strategy was adopted: first, cleaning the training data based on instance difficulty values, and then introducing a reject option when the models did not offer high-confidence predictions for test instances. The results show that it is possible to improve model performance at the cost of rejecting instances from the test set.
dc.description.sponsorshipCoordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
dc.description.sponsorshipID8887.507037/2020-00
dc.emailadvisor.customaclorena@gmail.com
dc.format.extent214 f.
dc.identifier.urihttps://hdl.handle.net/11600/71671
dc.languageeng
dc.publisherUniversidade Federal de São Paulo
dc.rightsinfo:eu-repo/semantics/restrictedAccess
dc.subjectMachine learning
dc.subjectHealthcare
dc.subjectData-centric
dc.titleMachine learning for healthcare: a data-centric approach
dc.title.alternativeAprendizado de máquina para a área da saúde: uma abordagem centrada nos dados
dc.typeinfo:eu-repo/semantics/doctoralThesis
unifesp.campusInstituto de Ciência e Tecnologia (ICT)
unifesp.graduateProgramPesquisa Operacional
unifesp.knowledgeAreaCiência de dados
unifesp.researchAreaAprendizado de máquina
Arquivos
Pacote Original
Agora exibindo 1 - 1 de 1
Carregando...
Imagem de Miniatura
Nome:
Tese_mgabriela.pdf
Tamanho:
5.12 MB
Formato:
Adobe Portable Document Format
Descrição:
Licença do Pacote
Agora exibindo 1 - 1 de 1
Carregando...
Imagem de Miniatura
Nome:
license.txt
Tamanho:
5.55 KB
Formato:
Item-specific license agreed upon to submission
Descrição: