TO THE PROBLEM OF NONPARAMETRIC ROBUST ESTIMATION OF THE REGRESSION FUNCTION ON OBSERVATIONS


Cite item

Full Text

Abstract

There are parametric and nonparametric statistical models in the literature. These models differ from each other in levels of the prior indeterminacy in the statistical description of observations. The difference in ways these models were created tends to smoothing by introduction of transition models. It is explained by the fact that a statistical model, as well as any other model, is inevitable idealization and it can be only successful approximation of actual processes at its best. Emphasizing this fact, Box writes: “All models are irregular, but some of them are useful”. When using statistical procedures it is desirable to have information about what deviations have a decisive influ- ence on the final conclusion at statistical analysis. In case the true distribution is not normal, there can be questions of normal theory reference procedures applicability. The recent research approach called “robast statistics” and offered as “third generation statistics” after parametric and nonparametric statistics by American mathematician J. Tyyuki is devoted to answer formulated above questions and create statistical procedures insensitive to deviations from assump- tions. A number of publications on this approach constantly increases, there are already monographs, among them the first book of Hyubera, the book by F. Hampel and others, educational literature is also available. The “robust” term, which corresponds to the definition “rough, strong”, was introduced into statistical literature by Box in 1953 and since the middle of the sixtieth this term has became conventional for the section of statistics where statistical procedures insensitive to deviations from the accepted model assumptions develop. The robust idea has had a long history, which was described in Stigler’s work. It appears in the work of K. Gauss, S. Newcomb, A. Eddington and others. However systematic development of robust ideas began with J. Tyyuki’s works and, especially, after the work of Hyuber in 1964. In this work an estimation of functions with a data outlier problem is given. In case of nonparametric indeterminacy the following steps are used to solve the problem: 1) the type of regression function with input data is set; 2) function estimation is applied. We suggest the following reliable robust nonparametric estimation approach. The main idea is to exclude the data which can affect estimation.

Full Text

Introduction. The problem of restitution of regression function on observations with outliers is considered [1-5]. When studying the task [6; 7], we use the suggested ro- bust estimation procedure, which is a correction of the training sample free of outliers [8-11]. Thus, we obtain cs > 0 , lims®¥ s(cs )k =¥ , lims®¥ cs = 0 . (3) In case of multi-dimensional data (k-dimensional) it is: å x j - x s k æ i ö ç y ÕФç j ÷ Подпись: i the function value and its restitution without outliers. = ø , i=1 Y (x) j =1 è cs ÷ (4) In the last decades of the last century, the intensive s s k æ x - xi ö development and application of nonparametric and robust methods of data processing began [12-17]. The reason is that on the one hand there is the need to control complex economic and social structures without parametric descriptions, as well as technical objects, for which, for example, the applied methods stability is important to failures and noise in the operation of recording equip- ment, on the other hand, the development of computing technology, which makes it possible to implement labori- ous algorithms [18]. Nonparametric regression function estimatimation on observations. Nonparametric estimation of regression function on observations for a one-dimensional case is the åÕФ ç j j ÷ i=1 j =1 ç cs ÷ è ø xi, yi, i = 1, s , - sample of observations; Ф(v) - bell-shaped function; v - random variable, cs - blur coefficient. Robust nonparametric regression function estimation on observations. Step-by-step experiment scheme is as followed: 1. The initial sample on an actual object is obtained. 2. We set up the blur coefficient and choose the bell- shaped function . 3. We check each sample point for estimation quality. If the estimation quality is sufficient and inequalfollowing [19; 20]: ity “more 2s2 ” is not satisfied, then the initial sample i ç c ÷ ås y Ф æ x - xi ö Y (x) = i=1 è s ø , (1) becomes the working sample. If the estimation quality is not sufficient and inequality “more 2s2 ” is satisfied, then outliers are exs s æ x - x ö åФç i ÷ i=1 è cs ø Ф(v) - is the kernel. The kernel is a finite bell-shaped square integrable function satisfying conditions [19; 20]: cluded from the initial sample and less points will become the working sample. 4. We restitute the regression function by means of nonparametric estimation. Computing experiment. y = sin (x)2 is a function 0 < Ф(v) < ¥ "v Î Ç(v) , 1 Фæ x - xi ö dx = 1 , c ò ç ÷ s è cs ø chosen for the computing experiment. When forming the training sample, outliers were artificially added. The triangular kernel is used as a bell-shaped function lim 1 Фæ x - xi ö = d( x - x ) , (2) Ф(v): c n®¥ ç ÷ i s è cs ø Ф(v) = ìï1- v , v £ 1, (5) í cs - blur coefficient which satisfies the following conditions: ïî0, v > 1. Further we perform the work with the entire sample constructing the function and its restitution, we find the For illustrative purposes, we will add the perturbation action to some observations: criterion of accuracy. As the criterion of nonparametric estimation accuracy we use the quadratic criterion: hi = lyix, where xÎ[-1,1] , noise level is l = 5 %. (8) s s2 = 2 (6) There are the elements of the sample, its approxima- å( yi - ys (xi )) , i =1 yi - a true sample received on the formulas given above; ys(xi) - is a nonparametric estimate. After checking the accuracy criterion, we pay atten- tion to the points at which the restitution error is big and they satisfy criterion (7). Elements of the training sample tion and two outliers on fig. 1. The restitution accuracy is 0.36. It is the same on fig. 2, except that 5 % noise level is added. The restitution accuracy is 0.40. It should be noted that restitution accuracy depends on whether there is the noise in the function. There are five outliers on fig. 3. The restitution accu- racy has obviously changed and is equal to 0.54. that satisfy the requirement: ri > 2s2, (7) Using sample. ri > 2s2 , we exclude outliers from the initial where ri = ( yi - ys (xi )), i = 1, s, are allocated and ex- cluded from the initial sample. We consider in fig. 1 - is a training sample, 2 - is nonparametric estimation. The triangular kernel was used as a bell-shaped finite function, We present the results of the numerical experiment il- lustrating the effectiveness of an algorithm. We consider restitution of regression function on observations, which has several outliers at a sample size 100. Fig. 4 displays algorithm work with regard to robast estimation. In this case the sample size decreased because the program excluded outliers enterfering good restitu- tion. In fig. 5 the 5 % noise level is added to the restituted function, accuracy of restitution decreased - 0.11, that is more than in fig. 4. Note that restitution accuracy signifi- cantly increased, not 0.36 and 0.54, but 0.06. It means that the given function was basically completely restituted. As an experiment, the same function with the same outliers but for a smaller sample size - 60 was considered. Fig. 1. Two-outlier restituted function Рис. 1. Восстановленная функция с учетом двух выбросов Fig. 2. Two-outlier restituted function with 5 % noise level Рис. 2. Восстановленная функция с учетом двух выбросов и помехой 5 % Fig. 3. Five-outlier restituted function Рис. 3. Восстановленная функция с учетом пяти выбросов Fig. 4. None-outlier restituted function Рис. 4. Восстановленная функция без учета выбросов Fig. 5. None-outlier restituted function with 5 % noise level Рис. 5. Восстановленная функция без учета выбросов, но с помехой 5 % Sample units, its approximation and two outliers are also given in fig. 6. Restitution accuracy decreased to 0.45. Fig. 7 displays sample units with five outliers, the restitution error is 0.69. In fig. 8 the restituted function without outliers is presented, the accuracy of restitution is 0.14. It is worth noticing that the sample size considerably influences restitution accuracy. For example, the accuracy of 100 elements sample size with regard to two outliers was 0.36, in the same case of 60 elements sample size it was 0.45. For descriptive reasons we will consider one more similar function: y = cos(x)2 · sin(x) with 100 elements sample size. Sample units with one and three outliers respectively are given in fig. 9-11. Restitution accuracy at one outlier is 0.31, and at three - 0.41. In fig. 10 the 5 % noise level was added to one-outlier res- tituted function. The restitution accuracy - 0.33. In this case, accuracy of restitution was not strongly affected by the noise. Fig. 12 shows the restitution of function without outliers, the accuracy of restitution is 0.04. And in fig. 13 there is already 5 % noise level, restitution accuracy is 0.12. In this case, accuracy significantly decreased. Fig. 2. Two-outlier function estimation Рис. 6. Восстановленная функция с учетом двух выбросов Fig. 7. Five-outlier restituted function Рис. 7. Восстановленная функция с учетом пяти выбросов Fig. 8. None-outlier restituted function Рис. 8. Восстановленная функция без учета выбросов Fig. 9. One-outlier restituted function Рис. 9. Восстановленная функция с учетом одного выброса Fig. 10. One-outlier restituted function with 5 % noise level Рис. 103. Восстановленная функция с учетом одного выброса и помехой 5 % Fig. 11. Three-outlier restituted function 830 Рис. 11. Восстановленная функция с учетом трех выбросов Fig. 12. None-outlier restituted function Рис. 12. Восстановленная функция без учета выбросов Fig. 13. None-outlier restituted function with 5 % noise level Рис. 13. Восстановленная функция без учета выбросов, но с помехой 5 % Conclusion. The main result of the article is that by means of the robast estimation approach it is possible to obtain significantly better function restitution quality on observations. It is worth noticing that restitution accu- racy considerably increased after we excluded outliers. For descriptive reasons of the experiment several func- tions for restitution were considered. For the first function two sample sizes 100 and 60 were considered, we were visually convinced that the sample size has not small value for restitution. The restitution accuracy is signifi- cantly higher if the sample size is equal to 100 rather than if it is equal to 60.
×

About the authors

L. N. Sopova

Reshetnev Siberian State University of Science and Technology

31, Krasnoyarsky Rabochy Av., Krasnoyarsk, 660037, Russian Federation

S. S. Chernova

Siberian Federal University, Institute of space and Information Technologies

Email: chsvetlanas@gmail.com
26b, Academica Kirenskogo Str., Krasnoyarsk, 660074, Russian Federation

References

  1. Шуленин В. П. Робастные методы математиче- ской статистики. Томск : НТЛ, 2016. 210 с.
  2. Тарасенко Ф. П. Непараметрическая статистика. Томск : Изд-во Том. ун-та, 1976. 292 с.
  3. Хьюбер П. Робастность в статистике. М. : Мир, 1989. 304 с.
  4. Чернова С. С., Шишкина А. В. О непараметри- ческом оценивании взаимно неоднозначных функций по наблюдениям // Молодой ученый. 2017. № 25. С. 13-20.
  5. Korneeva A., Chernova S., Shishkina A. Nonpara- metric algorithms for recovery of mutually unbeatted functions on observations // Applied Methods of Statisti- cal Analysis. Nonparametric methods in cybernetics and system analysis - AMSA’2017 (18-22 September). Kras- noyarsk. P. 64-72.
  6. Лонер Р. Л., Уилкинсон Г. Н. Устойчивые ста- тистические методы оценки данных : пер. с англ. под ред. Н. Г. Волкова. М. : Машиностроение, 1984. 229 с.
  7. Box G. E. P. Non-normality and test on variances // Biometrika. 1953. Vol. 40. P. 318-335.
  8. Робастность в статистике. Подход на основе функций влияния / Ф. Хампель [и др.]. М. : Мир, 1989. 512 с.
  9. Шуленин В. П. Математическая статистика. Ч. 1. Параметрическая статистика : учебник. Томск : НТЛ, 2012. 540 с.
  10. Шуленин В. П. Математическая статистика. Ч. 2. Непараметрическая статистика. Томск : НТЛ, 2012. 388 с.
  11. Шуленин В. П. Математическая статистика. Ч. 3. Робастная статистика. Томск : НТЛ, 2012. 520 с.
  12. Stigler S. M. Simon Newcomb, Percy Daniel and history of robust estimations // J. Amer. Statist. Assoc. 1973. Vol. 68. P. 872-879.
  13. Tukey J. W. A survey of sampling from contami- nated distributions // Contributions to Prob. Statist. / Ingram Olkin, ed. Stanford Univ. Press, 1960. P. 448-485.
  14. Tukey J. W. Bias and confidence in not-quite large samples (Abstract) // Ann. Math. Statist. 1958. Vol. 29. P. 614.
  15. Tukey J. W. Data Analysis, Computation and Mathematics // Quarterly of Applied Mathematics. 1972. Vol. XXX, No. I. Special Issue: Symposium on the Future of Applied Mathematics. P. 51-65.
  16. Tukey J. W. Exploratory Data Analysis. Reading, Mass. : Addison Wesley, 1977.
  17. Huber P. J. Robust estimation of location parame- ter // Ann. Math. Statist. 1964. Vol. 35. No. 1. P. 73-101.
  18. Китаева А. В. Робастное и непараметрическое оценивание характеристик случайных последователь- ностей : дис. … д-ра физ.-мат. наук. Томск, 2009. 324 с.
  19. Надарая Э. А. Непараметрическое оценивание плотности вероятностей и кривой регрессии. Тбилиси : ТГУ, 1983. 194 с.
  20. Медведев А. В. Основы теории адаптивных систем / СибГАУ. Красноярск, 2015. 526 с.

Supplementary files

Supplementary Files
Action
1. JATS XML

Copyright (c) 2017 Sopova L.N., Chernova S.S.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

This website uses cookies

You consent to our cookies if you continue to use our website.

About Cookies