Impact of Sample Size on the Robustness of Machine Learning Algorithms for Detecting Loan Defaults Using Imbalanced Data

Boitumelo Tryphina Kobone; Tlhalitshi Volition Montshiwa

doi:10.47738/jads.v6i3.713

Impact of Sample Size on the Robustness of Machine Learning Algorithms for Detecting Loan Defaults Using Imbalanced Data

Boitumelo Tryphina Kobone, Tlhalitshi Volition Montshiwa

Abstract

This study aimed to assess the impact of sample size on the robustness of five machine learning classifiers: Support Vector Machine (SVM), Random Forest (RF), Naïve Bayes (NB), Decision Trees (DT), and K-Nearest Neighbour (K-NN). Although there are data-balancing techniques that aid in addressing data imbalance, they have some limitations which are discussed in this paper. The current study continues the trend in the application of these five ML classifiers for credit default detection, but it makes a contribution by examining whether sample size increment can better their performance when they are trained using a different imbalanced loan default dataset which has not been the focus of previous studies, although most ML algorithms are known to perform well when trained with large datasets. The study used a secondary loan default imbalanced dataset from Kaggle.com, where 85% of participants made loan payments and 15% defaulted. Stratified random sampling was used to select different sample sizes starting with 2% of the total observations, followed by 5%, then 10% up to 90% of the dataset, with the dependent variable being the stratum. The study found no consistent change in the classification metrics with the change in sample size, but RF and DT achieved 100% performance regardless of sample size and are therefore recommended as the most robust to data imbalance in loan default detection. The average classification metrics for NB and K-NN ranged from 72% to 92%, and SVM produced the lowest averages which were between 69% and 75%. NB, K-NN and SVM yielded poor sensitivity rates of 0% to 53%, indicating poor loan payments prediction but they had sensitivity scores in range of 84% to 86%, indicating good loan default classification. Future studies should consider other sampling methods, deep and hybrid learning methods with comparison to RF and DT.

Article Metrics

Abstract: 709 Viewers PDF: 903 Viewers

Keywords

Machine Learning Classifiers; Imbalanced Data; Sample Size; Loan Default

Cite:

How to cite item

Full Text:

PDF

DOI: https://doi.org/10.47738/jads.v6i3.713

Citation Analysis:

Refbacks

There are currently no refbacks.

Journal of Applied Data Sciences

ISSN	:	2723-6471 (Online)
Publisher	:	Bright Publisher
Website	:	http://bright-journal.org/JADS
Email	:	taqwa@amikompurwokerto.ac.id (principal contact)
		support@bright-journal.org (technical issues)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me