Tackling a complex issue of automatic classification? Struggling with imbalanced data? Trying to achieve a better performance of automatic classification of more than three classes? We are familiar with these challenges and face them happily. Our data science team trained an algorithm coping with highly complex classification issues.
Computers are good at classification tasks in which there are clear-cut and pre-set categories, for instance distinguishing welcomed e-mails from spams, or deciding what species of the taxonomy an animal belongs to. However, humans tend to cope with much more complex classification tasks, involving such complex mental processes that cannot simply be translated into category features. Computers still lag behind in modelling these cases. Another challenge that automatic classifiers have to meet regularly is the imbalanced nature of the training data. If certain classes are overrepresented in your learning data set, your algorithm will be likely to overlearn them and perform worse than expected in actual use. We offer a solution to both problems by teaching an algorithm that can classify your data as similarly to humans as possible.
One of our clients, Járókelő, provided us with data, in which human factor was prevalent. Based on previous studies, we chose to train a Random Forest classifier on four versions of the same training data. To be able to contrast the results, we tested all of them on the same imbalanced data set. It was proven that it is worth balancing your data. Among the methods of oversampling, undersampling and that of the combination of the two, each method generated promising results.
If you want a classifier that can work with a high precision, then you should choose the method of Smoteenn, which is a combination of under- and oversampling. In our study, precision was as high as 0.667, which was only 0.420 in the case of imbalanced training data. If accuracy matters the most, then you should work with oversampled training data, which made the performance increase from 0.479 to 0.491. Consequently, each version performs well in respect to different factors, so it depends on your clients’ needs how you fine-tune your algorithm.