TY - JOUR
T1 - Towards UCI+
T2 - A mindful repository design
AU - Macià, Núria
AU - Bernadó-Mansilla, Ester
PY - 2014/3/10
Y1 - 2014/3/10
N2 - Public repositories have contributed to the maturation of experimental methodology in machine learning. Publicly available data sets have allowed researchers to empirically assess their learners and, jointly with open source machine learning software, they have favoured the emergence of comparative analyses of learners' performance over a common framework. These studies have brought standard procedures to evaluate machine learning techniques. However, current claims - such as the superiority of enhanced algorithms - are biased by unsustained assumptions made throughout some praxes. In this paper, the early steps of the methodology, which refer to data set selection, are inspected. Particularly, the exploitation of the most popular data repository in machine learning - the UCI repository - is examined. We analyse the type, complexity, and use of UCI data sets. The study recommends the design of a mindful data repository, UCI+, which should include a set of properly characterised data sets consisting of a complete and representative sample of real-world problems, enriched with artificial benchmarks. The ultimate goal of the UCI+ is to lay the foundations towards a well-supported methodology for learner assessment.
AB - Public repositories have contributed to the maturation of experimental methodology in machine learning. Publicly available data sets have allowed researchers to empirically assess their learners and, jointly with open source machine learning software, they have favoured the emergence of comparative analyses of learners' performance over a common framework. These studies have brought standard procedures to evaluate machine learning techniques. However, current claims - such as the superiority of enhanced algorithms - are biased by unsustained assumptions made throughout some praxes. In this paper, the early steps of the methodology, which refer to data set selection, are inspected. Particularly, the exploitation of the most popular data repository in machine learning - the UCI repository - is examined. We analyse the type, complexity, and use of UCI data sets. The study recommends the design of a mindful data repository, UCI+, which should include a set of properly characterised data sets consisting of a complete and representative sample of real-world problems, enriched with artificial benchmarks. The ultimate goal of the UCI+ is to lay the foundations towards a well-supported methodology for learner assessment.
KW - Classification
KW - Data complexity
KW - Data repository
KW - Synthetic data set
UR - http://www.scopus.com/inward/record.url?scp=84891781761&partnerID=8YFLogxK
U2 - 10.1016/j.ins.2013.08.059
DO - 10.1016/j.ins.2013.08.059
M3 - Article
AN - SCOPUS:84891781761
SN - 0020-0255
VL - 261
SP - 237
EP - 262
JO - Information Sciences
JF - Information Sciences
ER -