Informative attribute selection with hybrid self-adjusted evolutionary optimization algorithm


Дәйексөз келтіру

Толық мәтін

Аннотация

An informative attribute selection problem is considered. The problem is solved with the hybrid self-adjusted evolutionary algorithm. The algorithm is used as an optimization method of bandwidth parameters in kernel regression. The algorithm is experimented on the test function with various dimensions. Reliability depends on the dimension function which is also presented. The results of the hybrid self-adjusted algorithm are presented too.

Толық мәтін

1. Introduction Processes in technical and organization systems are described with many attributes. But using of non-informative attributes in model is negative for research. And so processing with real data should start with selection of informative attributes. The informative attribute selection problem is well-known. There are many classical methods for solving this problem (Factor analysis methods, Principal component analysis) [1]. But these methods usually have limitations in practice, i. e. principal component analysis can be not appropriate for processes with strong nonlinear relations among variables. And now new methods are investigated. Informative attribute selection problem is solved by regression models in this paper. They base on Nadaraya-Watson nonparametric estimation (kernel regression) [2]. This estimation has advantages: the method does not need to search a structure of a mathematical model of the process and it can be applied for processes with nonlinear relations among variables. Nonparametric regression estimation (Nadaraya-Watson kernel regression) depends on parameter “bandwidth” (see section 2). And so successful at using of kernel regression comes to optimization problem. If kernel regression is used for multivariable function, it is necessary to select bandwidth for each variable. Using of classical optimization methods is difficult. This problem has non-analytical form and high dimension. One of the possible methods for this optimization problem may be evolutionary algorithm [3]. In paper it is proposed to use genetic algorithm. In this case individuals are binary code representation of the vector of bandwidth parameters at nonparametric regression estimation. Fitness function is average error of the nonparametric regression estimation for test sample. Idea of informative attribute selection bases on one of the properties of non-parametric evaluation. Bandwidths of non-informative attributes trend to high values (see section 2). And so high optimal (suboptimal) value of bandwidth means low information content of the relevant attribute. It is necessary to notice that genetic algorithm has disadvantage because reliability of genetic algorithm greatly depends on algorithm settings. And experienced researcher need to spend a lot of time to select effective settings, so it is necessary to use self-adjusted algorithm. In paper it is proposed to use hybrid self-adjusted genetic algorithm. This work is outlined as follows: Information about kernel regression is presented in section 2. Information about genetic algorithm and algorithm setting is introduced in section 3. Experiments are described in section 4; results of algorithm running are discussed in section 4 too. 2. Kernel regression Let’s us consider Nadaraya-Watson kernel regression. The kernel regression is a non-parametric technique in statistics to estimate the conditional expectation of a random variable. Let (xb x2, ..., xn) be a vector of variable values, y be a regression value, N be a training sample size. Non-parametric regression estimation for variable vector (xi*, x2*, ..., xn*) looks like (1): y(x ) = £y- ПФ i=1 j=1 (1) £ПФ i=1 j=1 here cj is bandwidth parameter, Ф(...) is kernel function (weighting function). Quality of regression estimation does not depend on selection of kernel function significantly. This choice is determined by requirements of estimation differentiability. Triangular is one of the common types of kernel functions: Ф (t) = 1 - |t|, if < 1, 0, else. 8 Вестник СибГАУ. № 1(53). 2014 Quality criterion is used for assessment regression estimation. In paper quality criterion is mean-square error (2): W = T £ (i- у-), (2) t i =1 here Nt is an exam sample, yi is regression value (from exam sample), yi is estimation of regression value (calculated by kernel regression). Problem of non-parametric regression estimation construction comes to selection of the best values of bandwidth parameters cj, i. e. to minimize quality criterion W with bandwidth parameters cj. It is necessary to notice optimal bandwidth parameter cj is increasing for non-informative attributes [4]. If cj trends to infinity (cj ^®), argument of kernel function Ф^) trends to zero, and Ф(0) = 1. In this situation kernel regression (1) does not depend on variable value xj. 3. Hybrid self-adjusted genetic algorithm for optimization of bandwidth parameter Optimization of the bandwidth parameters is complicated problem because the problem has non-analytical form and dimension can be high. So an appropriate tool for such problem solving can be genetic algorithm. Genetic algorithm is a stochastic optimization method of direct search. And so this algorithm is wide applicability. And genetic algorithm can process with complex discontinuous implicitly defined function. This optimization method copies evolution processes and uses idea of collective learning. Population is a set of individuals. Each individual is point of search space and takes part in collective learning. Individual has fitness calculated with value of criterion function (quality criterion). Genetic algorithm has stages such as generation formation, selection, crossover, and mutation. Individuals compete with other individuals in the population for transmission of its genetic information to the next population (selection stage). Selected individuals are parents for creating new offspring-individuals (crossover stage). In our case individuals are binary code representation of the vector of bandwidth parameters at nonparametric regression estimation. Fitness function is average error of the nonparametric regression estimation for train sample. There are various selection types (proportional, rank, tournament), various crossover types (two-point, one-point, uniform), and various mutation types (week, average, strong). It means there are many combinations of algorithm settings. Genetic algorithm greatly depends on algorithm settings. There are no universal settings because genetic algorithm realizes two strategies. First strategy is exploration. Aim is search of new solution areas. It is the most valid on initial stages of search. Second strategy is execution. It needed for improving of existing decision. It is the most valid on final stages of optimization algorithm. In genetic algorithm mutation realizes exploration strategy, crossover realizes execution strategy. Also the most effective algorithm settings depend on function topology. And so it is necessary to implement algorithm with self-adjusted settings during optimization process. Self-adjusted method is based on hybrid self-adjusted evolutionary Gomez algorithm. This algorithm combines genetic algorithm and evolutionary strategies [5]. So it names hybrid method. The idea of the algorithm is as follows: every individual has personal probability of using each type of genetic operators. The algorithm randomly selects combination of settings (proportional selection) and runs with each individual in population with personal settings. In the end of generation the offspring is compared with its parent. If fitness function value of the offspring is better than fitness function value of the parent, then probability of selected operator types are increased. Otherwise probabilities of selected operator types are decreased. Probability of using selected operator is evaluated with (3): pk = (1 + 8) • pk, k = 1, 2, 3, if (fitness(offspring) > fitness(ind.)); (3) pk = (1 - 8) • pk, k = 1, 2, 3, () if ( fitness(offspring) < fitness(indj)), here k is sequence number of operator (selection, crossover, mutation); pk is probability of using selected type operator; 5 is a training parameter that is generated with equally probable distribution distribution on interval [0, 1]; offspring is new individual indi is parent-individual; fitness(ind) is fitness function value of individual ind. After that probabilities of all types of all genetic operators are normalized. It is necessary to notice in first population the probability of using various genetic operators is the same. So we use hybrid self-adjusted evolutionary optimization algorithm for bandwidth parameters optimization at nonparametric estimation regression. 4. Experiments and Result Linear combinations of input attributes are taken as test functions for our method: 1) y(x) = 0,01 • x1 + 7• x2 + 5 • x3; 2) y(x) = 0,01 • x1 + 7 • x2 + 5 • x3 +12 • x4 + 8 • x5; y( x) = 0,01 • x1 + 7 • x2 + 5 • x3 + +12 • x4 + 8 • x5 +15 • x6 + 3 • x7 4) y( x) = 0,01 • x1 + 7 • x2 + 5 • x3 +12 • x4 + _ +8 • x5 +15 • x6 + 3 • x7 + 9 • x8 +13,5 • x9 Functions with various dimensions are examined. And all functions have non-informative attribute (with low weight coefficient). The aim of research is to identify non-informative attributes. Training sample (100 points) is generated randomly from the interval [0, 3] with equally probable distribution for each attributes. Test sample (100 points) is generated randomly too for the same interval. Algorithm is tested without noise and with noise (10 %). We use centered noise with mean equals zero and we add it randomly with 9 Математика, механика, информатика with equally probable distribution. Algorithm has 50 individuals for 50 generations. We tested each setting of the genetic algorithm for each problem for 100 times. After that we calculated reliability of the algorithm. It means percentage of the successful runs of the algorithms. Successful run means that algorithm found the least important variable with the highest value of the bandwidth parameter. We have got values of reliability for 10 times for statistical significance of our numerical experiments and corresponding conclusions. Experiments were conducted at the same algorithm resource on function with various dimensions because it is necessary to estimate effectiveness of self-adjusted genetic algorithm in the same conditions and fall of algorithm reliability with increasing dimensions. Implemented genetic algorithm runs 20 times for each combination of settings. After that bandwidth parameters are averaged for each attributes. In each runs the algorithm finds non-informative attributes, mean-square error of kernel regression with all attributes and without each attribute is calculated. Obtained statistical data is processed; results are presented in tables and graphs. Previously numerical research was investigated with all combinations of algorithm settings. Genetic algorithms with different settings have various reliabilities on test functions [6]. Results were averaged like expectation value, if researcher does not know the best algorithm settings. The most effective algorithm settings were found with exhaustive search, and result on these settings was saved. Fig. 1 and 2 display averaged results and result on the best settings and with self-adjusted result. Fig. 1. Algorithm reliability dependence of test function dimension without noise Obviously reliability of hybrid self-adjusted genetic algorithm above averaged results and little below result on the best settings. And so if researcher does not know the most effective algorithm settings, he should use selfadjusted algorithm. Hybrid self-adjusted genetic algorithm does not increase running time compared with the standard algorithm because much time is required to calculate the fitness function and these algorithms have the same number of fitness function calculations on the generation. Fig. 2. Algorithm reliability dependence of test function dimension with 10 % noise on training sample Fig. 3. Bandwidth parameter values for test function 4 You should pay attention to research result like result of informative attribute selection problem with kernel regression [6]. Fig. 3 and table present results of genetic algorithm on test function 4. Fig. 3 shows the values of bandwidth parameters. It is necessary to notice bandwidth parameter values are increased with reducing of weight of corresponding attribute in criterion function value. That is to say if you make decreasing sequence of bandwidth value, attributes is located mainly in order of increasing attribute weight. Only mean-square error of kernel regression without non-informative attribute may compare with mean-square error with all attributes (see table). Fig. 3 and table show the final results for the different features with the conclusion that features 1 and 7 are the least important ones. 10 Вестник СибГАУ. № 1(53). 2014 Mean-square error of kernel regression on test function 4 Without noise 10 % noise Mean-square error without attribute 1 96,7 128,38 Mean-square error without attribute 2 136,43 148,63 Mean-square error without attribute 3 107,71 144,03 Mean-square error without attribute 4 185,31 221,83 Mean-square error without attribute 5 128,82 152,7 Mean-square error without attribute 6 261,45 261,23 Mean-square error without attribute 7 98,91 129,77 Mean-square error without attribute 8 144,75 179,09 Mean-square error without attribute 9 190,72 228,52 Mean-square error with all attributes 98,07 127,98 In addition we tested parallel version of our method for multiprocessor based on the parallelization of the genetic algorithm (fitness function calculation). We used multiprocessor with 2, 4, and 6 cores. And we got following values of the computing speed-up coefficients: 1) Configuration with 2 cores: 1,81-1,85; 2) Configuration with 4 cores: 3,20-3,76; 3) Configuration with 6 cores: 3,63-4,14. So we can conclude that our method is appropriate for parallelization and using for multiprocessors. 5. Conclusions So hybrid self-adjusted evolutionary algorithm is implemented for informative attribute selection. Reliability of its algorithm was experimented. Genetic algorithm effectively solves optimization problem with bandwidth parameters in kernel regression. Hybrid self-adjusted algorithm solves problem of algorithm setting. So hybrid self-adjusted genetic algorithm solves informative attribute selection problem on test functions effectively. Also this algorithm gives some data for analysis of information content of the attributes. The method is appropriate for parallelization.
×

Авторлар туралы

Svetlana Volkova

Siberian State Aerospace University named after academician M. F. Reshetnev

Email: Svetlana.volkova.mail@yandex.ru
Master’s Degree student, Institute of Computer Science and Telecommunications

Әдебиет тізімі

  1. Aivazyan, S. A. [et al.] Applied statistics. Classification and reduction of dimensionality. Moscow, Finansy i statistika, 1989, 607 p.
  2. Medvedev A. V. Non-parametrical systems of adaptation. Novosibirsk, Nauka, 1983. 174 p.
  3. Goldberg D. E. Genetic algorithms in search, optimization and machine learning. Reading, MA: Addison-Wesley, 1989.
  4. Hall P., Li Q. and J. S. Racine, «Nonparametric Estimation of Regression Functions in the Presence of Irrelevant Regressors. Review of Economics and Statistics. 2007. 89. P. 784-789.
  5. Gomez J. Self-adaptation of operator rates in evolutionary algorithms. Proc. of Genetic and Evolutionary Computation Conference. 2004. Р. 11621173.
  6. Volkova S., Sergienko R. B. Informative attributes selection in non-parametric regression estimation by making use of genetic algorithms. Vestnik Tomskogo gosudarstvennogo universiteta. Ypravlenie, vichislitelnaya technika i informatica. 2013. № 1 (22). P. 40-48.

Қосымша файлдар

Қосымша файлдар
Әрекет
1. JATS XML

© Volkova S.S., 2014

Creative Commons License
Бұл мақала лицензия бойынша қолжетімді Creative Commons Attribution 4.0 International License.

Осы сайт cookie-файлдарды пайдаланады

Біздің сайтты пайдалануды жалғастыра отырып, сіз сайттың дұрыс жұмыс істеуін қамтамасыз ететін cookie файлдарын өңдеуге келісім бересіз.< / br>< / br>cookie файлдары туралы< / a>