TY - GEN
T1 - Quantile Encoder
T2 - 18th International Conference on Modeling Decisions for Artificial Intelligence, MDAI 2021
AU - Mougan, Carlos
AU - Masip, David
AU - Nin, Jordi
AU - Pujol, Oriol
N1 - Funding Information:
This work has been partially funded by the Spanish project PID2019-105093GB-I00 (MINECO/FEDER, UE).
Funding Information:
This work was partially funded by the European Commission under contract numbers NoBIAS?H2020-MSCA-ITN-2019 project GA No. 860630. This work has been partially funded by the Spanish project PID2019-105093GB-I00 (MINECO/FEDER, UE).
Funding Information:
Acknowledgements. This work was partially funded by the European Commission under contract numbers NoBIAS—H2020-MSCA-ITN-2019 project GA No. 860630.
Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - Regression problems have been widely studied in machine learning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solve the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models. In this paper, we provide an in-depth analysis of how to tackle high cardinality categorical features with the quantile. Our proposal outperforms state-of-the-art encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions. Besides, to deal with possible overfitting when there are categories with small support, our encoder benefits from additive smoothing. Finally, we describe how to expand the encoded values by creating a set of features with different quantiles. This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model.
AB - Regression problems have been widely studied in machine learning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solve the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models. In this paper, we provide an in-depth analysis of how to tackle high cardinality categorical features with the quantile. Our proposal outperforms state-of-the-art encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions. Besides, to deal with possible overfitting when there are categories with small support, our encoder benefits from additive smoothing. Finally, we describe how to expand the encoded values by creating a set of features with different quantiles. This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model.
KW - Categorical features
KW - Machine learning
KW - Regression problems
KW - Statistical learning
UR - http://www.scopus.com/inward/record.url?scp=85115869730&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-85529-1_14
DO - 10.1007/978-3-030-85529-1_14
M3 - Conference contribution
AN - SCOPUS:85115869730
SN - 9783030855284
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 168
EP - 180
BT - Modeling Decisions for Artificial Intelligence - 18th International Conference, MDAI 2021, Proceedings
A2 - Torra, Vicenç
A2 - Narukawa, Yasuo
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 27 September 2021 through 30 September 2021
ER -