TY - JOUR
T1 - A study of the effect of different types of noise on the precision of supervised learning techniques
AU - Nettleton, David F.
AU - Orriols-Puig, Albert
AU - Fornells, Albert
N1 - Funding Information:
Acknowledgments The authors are grateful to the following persons for their comments and suggestions with respect to the generation of noise and its incorporation into data sets: Dr. Vicenç Torra of The Institute for Investigation in Artificial Intelligence, Bellaterra, Catalunya, Spain, and Dr. Xingquan Zhu of the Department of Computer Science & Engineering, Florida Atlantic University, United States. Thank you to Dr. Mark Hall, previously of the University of Waikato (New Zealand), for his information about IBk. The authors also wish to acknowledge the Ministerio de Educación y Ciencia for its support under project TIN2008-06681-C06-05, and Generalitat de Catalunya for its support under grant 2009SGR-00183.
PY - 2010/3
Y1 - 2010/3
N2 - Machine learning techniques often have to deal with noisy data, which may affect the accuracy of the resulting data models. Therefore, effectively dealing with noise is a key aspect in supervised learning to obtain reliable models from data. Although several authors have studied the effect of noise for some particular learners, comparisons of its effect among different learners are lacking. In this paper, we address this issue by systematically comparing how different degrees of noise affect four supervised learners that belong to different paradigms. Specifically, we consider the Naïve Bayes probabilistic classifier, the C4.5 decision tree, the IBk instance-based learner and the SMO support vector machine.We have selected four methods which enable us to contrast different learning paradigms, and which are considered to be four of the top ten algorithms in data mining (Yu et al. 2007). We test them on a collection of data sets that are perturbed with noise in the input attributes and noise in the output class. As an initial hypothesis, we assign the techniques to two groups, NB with C4.5 and IBk with SMO, based on their proposed sensitivity to noise, the first group being the least sensitive. The analysis enables us to extract key observations about the effect of different types and degrees of noise on these learning techniques. In general, we find that Naïve Bayes appears as the most robust algorithm, and SMO the least, relative to the other two techniques. However, we find that the underlying empirical behavior of the techniques is more complex, and varies depending on the noise type and the specific data set being processed. In general, noise in the training data set is found to give the most difficulty to the learners.
AB - Machine learning techniques often have to deal with noisy data, which may affect the accuracy of the resulting data models. Therefore, effectively dealing with noise is a key aspect in supervised learning to obtain reliable models from data. Although several authors have studied the effect of noise for some particular learners, comparisons of its effect among different learners are lacking. In this paper, we address this issue by systematically comparing how different degrees of noise affect four supervised learners that belong to different paradigms. Specifically, we consider the Naïve Bayes probabilistic classifier, the C4.5 decision tree, the IBk instance-based learner and the SMO support vector machine.We have selected four methods which enable us to contrast different learning paradigms, and which are considered to be four of the top ten algorithms in data mining (Yu et al. 2007). We test them on a collection of data sets that are perturbed with noise in the input attributes and noise in the output class. As an initial hypothesis, we assign the techniques to two groups, NB with C4.5 and IBk with SMO, based on their proposed sensitivity to noise, the first group being the least sensitive. The analysis enables us to extract key observations about the effect of different types and degrees of noise on these learning techniques. In general, we find that Naïve Bayes appears as the most robust algorithm, and SMO the least, relative to the other two techniques. However, we find that the underlying empirical behavior of the techniques is more complex, and varies depending on the noise type and the specific data set being processed. In general, noise in the training data set is found to give the most difficulty to the learners.
KW - Attribute noise
KW - Class noise
KW - Machine learning techniques
KW - Noise impacts
UR - http://www.scopus.com/inward/record.url?scp=84898030282&partnerID=8YFLogxK
U2 - 10.1007/s10462-010-9156-z
DO - 10.1007/s10462-010-9156-z
M3 - Article
AN - SCOPUS:84898030282
SN - 0269-2821
VL - 33
SP - 275
EP - 306
JO - Artificial Intelligence Review
JF - Artificial Intelligence Review
IS - 4
ER -