Constraint handling genetic algorithm for feature engineering in solving classification problems

Maksim A. Denisov; Денисов Максим Андреевич; Evgenii A. Sopov; Сопов Евгений Александрович

doi:10.31772/2712-8970-2021-22-1-18-31

Constraint handling genetic algorithm for feature engineering in solving classification problems

作者: Denisov M.A.¹, Sopov E.A.¹
隶属关系:
1. Reshetnev Siberian State University of Science and Technology
期: 卷 22, 编号 1 (2021)
页面: 18-31
栏目: Section 1. Computer Science, Computer Engineering and Management
##submission.datePublished##: 15.03.2021
URL: https://journals.eco-vector.com/2712-8970/article/view/562759
DOI: https://doi.org/10.31772/2712-8970-2021-22-1-18-31
ID: 562759

如何引用文章

全文:

详细
作者简介
参考
补充文件
统计

详细

Feature engineering in machine learning is a promising but still insufficiently studied domain. Creating new feature space from an original set allows increasing the accuracy of the machine learning algorithm chosen to solve complex data mining problems. Some existing selection methods are capable of simultaneously increasing the accuracy and reducing feature space. The reduction is an urgent task for big data problems.

The paper considers a novel machine learning approach for solving classification problems based on feature engineering methods. The approach constructs informative features using feature selection and extraction methods. Original data and features obtained by principal component analysis form a new set of features. The genetic algorithm selects an effective subset of informative features. It is important to avoid overfitting and builng a trivial classifier. Therefore, the fitness function is constrained for producing the given number of original features and the given number of features obtained by principal component analysis. The paper describes a comparative analysis of three classifiers, namely k-nearest neighbors, support vector machine and random forest. In order to prove the accuracy improvement, the authors examine several real-world problems chosen from the UCI Machine Learning repository. The accuracy measure in the study is the macro F1-score.

The results of numerical experiments show that the proposed approach outperforms the performance obtained using the original data set and the performance of random feature selection (the low bound for the results). Moreover, the accuracy enhancement is obtained for all types of problems (data sets that have more features than values). All results are proved to be statistically significant.

关键词

feature selection, feature construction, genetic algorithm, constraint optimization

作者简介

Maksim Denisov

Reshetnev Siberian State University of Science and Technology

编辑信件的主要联系方式.
Email: max_denisov00@mail.ru

Postgraduate

俄罗斯联邦, 31, Krasnoiarskii Rabochi Prospekt, Krasnoyarsk, 660037

Evgenii Sopov

Reshetnev Siberian State University of Science and Technology

Email: evgenysopov@gmail.com

PhD (CS), Associate Professor

俄罗斯联邦, 31, Krasnoiarskii Rabochi Prospekt, Krasnoyarsk, 660037

参考

Guzella T. S., Caminhas W. M. A review of machine learning approaches to spam filtering. Expert Systems with Applications. 2009, Vol. 36, No. 7, P. 10206–10222. doi: 10.1016/j.eswa. 2009.02.037.
Ballestar M. T., Grau-Carles P., Sainz J. Predicting customer quality in e-commerce social networks: a machine learning approach. Review of Managerial Science. 2019, Vol. 13, No. 3, P. 589–603. doi: 10.1007/s11846-018-0316-x.
Bahlmann C., Haasdonk B., Burkhardt H. Online handwriting recognition with support vector machines-a kernel approach. Proceedings Eighth International Workshop on Frontiers in Hand-writing Recognition. 2002, P. 49–54. doi: 10.1109/IWFHR.2002.1030883.
Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in medicine. 2001. Vol. 23, No 1, P. 89–109. doi: 10.1016/S0933-3657(01)00077-X.
Kouziokas G. N. Machine learning technique in time series prediction of gross domestic product. Proceedings of the 21st Pan-Hellenic Conference on Informatics. 2017, P. 1–2. Doi: 10.1145/ 3139367.3139443.
John G. H., Kohavi R., Pfleger K. Irrelevant features and the subset selection problem. Machine Learning Proceedings. 1994, P. 121–129. doi: 10.1016/B978-1-55860-335-6.50023-4.
Kira K., Rendell L. A. A practical approach to feature selection. Machine Learning Proceedings. 1992, P. 249–256. doi: 10.1016/B978-1-55860-247-2.50037-1.
Rendell L., Seshu R. Learning hard concepts through constructive induction: Framework and rationale. Computational Intelligence. 1990, Vol. 6, No. 4, P. 247–270. doi: 10.1111/j.1467-8640. 1990.tb00298.x.
Liu H., Motoda H. Feature extraction, construction and selection: A data mining perspective. Massachusetts : Kluwer Academic Publishers, 1998, 453 p.
Duboue P. The Art of Feature Engineering: Essentials for Machine Learning. Cambridge : Cambridge University Press, 2020, 270 p. doi: 10.1017/9781108671682.
Zheng A., Casari A. Feature engineering for machine learning: principles and techniques for data scientists. Sebastopol : O'Reilly Media Inc., 2018, 193 p.
Li J., Cheng K., Morstatter F. et al. Feature selection: A data perspective. ACM Computing Surveys (CSUR). 2017, Vol. 50, No. 6, P. 1–45. doi: 10.1145/3136625.
Park M. S., Na J. H., Choi J. Y. PCA-based feature extraction using class information. 2005 IEEE International Conference on Systems, Man and Cybernetics. 2005, Vol. 1, P. 341–345. doi: 10.1109/ICSMC.2005.1571169.
Abdi H., Williams L. J. Principal component analysis. Wiley interdisciplinary reviews: computational statistics. 2010, Vol. 2, No. 4, P. 433–459. doi: 10.1002/wics.101.
Markovitch S., Rosenstein D. Feature generation using general constructor functions. Machine Learning. 2002, Vol. 49, No. 1, P. 59–98. doi: 10.1023/A:1014046307775.
Hirsh H., Japkowicz N. Bootstrapping training-data representations for inductive learning: A case study in molecular biology. AAAI-94 Proceedings, 1994, P. 639–644.
Sutton R. S., Matheus C. J. Learning polynomial functions by feature construction. Machine Learning Proceedings. 1991, P 208–212. doi: 10.1016/B978-1-55860-200-7.50045-3.
Zhao H., Sinha A. P., Ge W. Effects of feature construction on classification performance: An empirical study in bank failure prediction. Expert Systems with Applications. 2009, Vol. 36, No. 2, P. 2633–2644. doi: 10.1016/j.eswa.2008.01.053.
Pagallo G. Haussler D. Boolean feature discovery in empirical learning. Machine learning. 1990, Vol. 5, No 1, P. 71–99. doi: 10.1023/A:1022611825350.
Matheus C. J., Rendell L. A. Constructive Induction on Decision Trees. IJCAI'89: Proceedings of the 11th international joint conference on Artificial intelligence. 1989, Vol. 89, P. 645–650.
Krawiec K. Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines. 2002, Vol. 3, No. 4, P. 329–343. doi: 10.1023/A:1020984725014.
Smith M. G., Bull L. Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines. 2005, Vol. 6, No. 3, P. 265–281. doi: 10.1007/s10710-005-2988-7.
Specia L., Srinivasan A., Sachindra J., et al. An investigation into feature construction to assist word sense disambiguation. Machine Learning. 2009, Vol. 76, No 1, P. 109–136. Doi: 10.1007/ s10994-009-5114-x.
Khalid S., Khalil T., Nasreen S. A survey of feature selection and feature extraction techniques in machine learning. 2014 Science and Information Conference. 2014, P. 372–378. doi: 10.1109/SAI. 2014.6918213.
Krivenko M. P. [Significance tests of feature selection for classification]. Informatics and Applications. 2016, Vol. 10, No. 3, P. 32–40. doi: 10.14357/19922264160305. (In Russ.)
Miao J., Niu L. A survey on feature selection. Procedia Computer Science. 2016, Vol. 91, P. 919–926. doi: 10.1016/j.procs.2016.07.111.
Chandrashekar G., Sahin F. A survey on feature selection methods. Computers & Electrical Engineering. 2014, Vol. 40, No. 1, P. 16–28. doi: 10.1016/j.compeleceng.2013.11.024.
Xue B., Zhang M., Browne W. et al. A survey on evolutionary computation approaches to fea-ture selection. IEEE Transactions on Evolutionary Computation. 2015, Vol. 20, No. 4, P. 606–626. doi: 10.1109/TEVC.2015.2504420.
Coello C. Theoretical and numerical constraint-handling techniques used with evolutionary algo-rithms: a survey of the state of the art. Computer methods in applied mechanics and engineering. 2002, Vol. 191, No. 11–12, P. 1245–1287. doi: 10.1016/S0045-7825(01)00323-1.
Barbosa H. J. C., Lemonge A. C. C. An adaptive penalty method for genetic algorithms in con-strained optimization problems. Frontiers in Evolutionary Robotics. 2008. doi: 10.5772/5446.
UCI Machine Learning Repository Available at: https://archive.ics.uci.edu/ml/index.php (accessed 09.01.2021).
Opitz J., Burst S. Macro f1 and macro f1. Preprint arXiv:1911.03347. Available at: https://arxiv.org/abs/1911.03347 (accessed 25.02.2021).
Pedregosa F., Varoquaux G., Gramfort A. et al. Scikit-learn: Machine learning in Python. Journal of machine Learning research. 2011, Vol. 12, P. 2825–2830.
Dong G., Liao G., Liu H, Kuang G. A review of the autoencoder and its variants: A comparative perspective from target recognition in synthetic-aperture radar images. IEEE Geo-science and Remote Sensing Magazine. 2018. Vol. 6, No. 3, P. 44–68. doi: 10.1109/MGRS.2018.2853555.

补充文件

附件文件

动作

1. JATS XML

下载

用户名
密码
记住我

忘记您的密码?	注册

用户名
密码
记住我

忘记您的密码?	注册