全文:
Introduction
The universal and generally accepted criterion for testing hypotheses about the distributions of random variables, including their independence, is the Pearson criterion [1]. When using it, it is necessary to solve the problem of partitioning the area of values of random variables into multivariate intervals and to establish the law of distribution of the criterion that determines the dependences between the probabilistic characteristics of random variables. In [24] a new approach is proposed that allows simplifying the test of the hypothesis of independence of random variables using a nonparametric algorithm of nuclear-type pattern recognition corresponding to the maximum likelihood criterion. The idea of the approach is to solve a two-alternative problem of pattern recognition. The classes being considered are defined by assumptions about dependence and independence of random variables. On this basis, a training sample is formed from the initial statistical data on observations of random variables and the problem of pattern recognition is solved. The ratio between the estimates of recognition error probabilities of the introduced classes confirms or refutes the hypothesis being considered.
The purpose of this paper is to generalise and develop a nonparametric method of testing the hypothesis of independence of random variables for conditions of the large volume of statistical data and its application in the analysis of information on remote sensing of anthropogenic territories.
Methodology for testing the hypothesis of independence of random variables
Let there be a sample of the volume, composed of independent observations of a two-dimensional random variable . Let us suppose that the sample is drawn from the general population characterised by the densities of the probabilities or . On the basis of statistical data of it is necessary to test the hypothesis
of independence of random variables , .
To test the hypothesis let us solve the two-alternative problem of pattern recognition. By classes , areas of definition for probability densities , are meant. Under these conditions, the Bayesian decision rule corresponding to the maximum likelihood criterion has the following form
In contrast to the traditional formulation of the pattern recognition problem, while synthesizing a decisive rule there is no a priori training sample containing information about the belonging of the sample elements to one or another class. This information must be discovered in the process of implementation of the hypothesis testing methodology, which is based on the following actions.
From the sample recover probability densities , , using their non-parametric Rosenblatt Parzen type estimates [5; 6],
,
.
In the statistics , nuclear functions satisfy the conditions of positivity, symmetry and normalization.
The values of blurring coefficients , = 1, 2 of the nuclear functions decrease as the volume of the sample of statistical data increases. Then the nonparametric decision rule for classification of random variables is written as follows
The optimal blurring coefficients of the nuclear functions of the decision rule are chosen on the basis of the analysis of approximation properties of nonparametric estimates of probability densities , , from the minimum condition, their corresponding estimates of standard deviations from the , , . For example, for such a criterion is [711]
.
Let us define the estimates of probabilities of pattern recognition errors , using the dsision rule on the basis of raw statistical data of at optimal blurring coefficients , of the nuclear functions of statistics , respectively.
The values are calculated in the ‘rolling examination’ mode on the sample assuming that its elements belong to the class.
,
where are designations of the type of ;
«solving» the algorithm of about the belonging of the situation to on of the classes , .
While calculating in accordance with the ‘rolling examination’ methodology the situation from the sample, which is fed into the algorithm of for control, is excluded from the process of producing statistics , .
The indicator function is defined by the expression
Let us denote by the value of the estimation of the probability of pattern recognition error assuming that the sample elements belong to the class , . Let us compare the values , .
Then the hypothesis is valid if < . Otherwise, at < the random variables and are independent.
When the volume of the smple is limited, the problem of confidence estimation of probabilities of pattern recognition errors arises. For its solution, the traditional methodology of confidence estimation of probabilities or Kolmogorov Smirnov criterion is used.
For example, when using the Kolmogorov Smirnov criterion, the deviation is compared to the threshold value [12]
.
Here is a probability (risk) of rejecting the hypothesis : . If the ratio < is satisfied, then the hypothesis is valid and the risk of rejecting it does not exceed the value ofβ. At > the hypothesis rejected.
Formation of sets of independent random variables
There is a sample of observations of the volume composed of statistically independent observations of the components of the multivariate random variable . The type of the probability density function is unknown a priori. It is necessary according to the statistics of , using the hypothesis testing criterion proposed above [1316]
For the components , , , , , to form the sets of the independent random variables , . The number of sets of components of the random variable x is unknown, and is a set of component numbers that make up the set .
The proposed methodology is based on performing the following steps:
- In accordance with the above recommendations, to test the hypotheses for each pair of the components of the multivariate random variable . The number of such pairs corresponds to the value .
- Based on the results of step 1, construct an information graph , where is a set of its vertices corresponding to the components of the random variable , and is a set of edges. Between the two vertices , there is an edge if the hypothesis is satisfied, i.e. the components , are independent.
- Analyse the information graph and determine its complete subgraphs , . Each vertice of the subgraph has an edge if the components of the random variable x are independent. Detect complete subgraphs using algorithms for cutting the original graph, which are based on analysing its adjacency matrix. The components , correwsponding to the vertices of the complete subgraph form a set of independent random variables.
Modification of the method of testing the hypothesis of independence of random variables in conditions of large volumes of statistical data
With large volumes of the statistical data regression estimates of probability densities , , are used in the proposed methodology. These estimates are based on the compression of the original information, e.g., into the data array by decomposing the area of values into intervals. Here are centres of sampling intervals of values , and is probability density estimation in the th interval; is a sampling interval length; is frequency of occurrence of the values from the sample in the interval numbered . Then the regression estimate of the probability density function according to has the form [17; 18]
.
The proposed approach allows reducing by orders of magnitude the volume of initial statistical information when estimating probability densities. The peculiarity of the statistics of the type allows simplifying considerably the choice of coefficients of blurring of nuclear functions in the statistics from the condition of minimum criterion
.
By analogy the estimation of the probability densities , is carried out. Regression estimates of probability densities are used in testing the hypothesis of independence of random variables according to the proposed methodology.
Analysing the results of the computational experiment
The effectiveness of the proposed method of testing the hypothesis of independence of two-dimensional random variables and Pearson's criterion in the conditions of ambiguous dependences at different volumes of statistical data has been compared [1921]. The sensors of random variables , were formed on the basis of the uniform distribution law , which was used in the calculation of the values of in the form of nonlinear transformations . At the same time the values of were superimposed with disturbances with the normal distribution law, which has zero mathematical expectation and standard deviation . An example of the values of random variables and is shown in Fig. 1.
Рис. 1. Значения случайных величин x1, x2 из выборки исходных статистических данных V при n = 500 и σ = 0,5 (темные точки), а при σ = 2 (серые точки) при использовании зависимостей различной сложности
Fig. 1. Values x1, x2 of random variables from a sample of initial statistical data V at n = 500 and σ = 0.5 (dark dots), and at σ = 2 (grey dots) when using dependencies of varying complexity
When testing the independence hypothesis of a two-dimensional random component based on the Pearson criterion, the results of the optimal selection of the number of sampling intervals are used [2224]
.
The value , and is the length of the interval between values of the random value , = 1.2. The works [2527] are devoted to the traditional formulas of discretization of the range of values of random quantities.
By the results of computational experiment the offered methodology and Pearson's criterion at the analysis of ambiguous dependences between random variables in conditions of relatively small volumes of statistical data and mean square deviations σ of interferences are comparable and unmistakably determine dependence of random variables. This conclusion does not hold for the dependence between random variables (Fig. 1, a), when the Pearson criterion does not establish dependence under the conditions n = 100 and σ [0.5; 2]. As σ increases, the efficiency of the criteria being compared decreases. This fact is explained by the peculiarities of ambiguous dependences and large values of σ, when the area of definition of random variables hides the desired dependence. With the increase in the n volume of initial data the efficiency of the compared criteria for testing the hypothesis of independence of random variables increases. This conclusion is expected, since asymptotic properties of nonparametric estimates of probability densities and frequencies of occurrence of random variables in their two-dimensional intervals rise as n increases. The advantage of the proposed methodology for testing the hypothesis of independence of random variables is observed at small values of σ, limited and large n. At large n and σ, the advantage of Pearson's criterion is often revealed if the procedure of optimal discretisation of the area of values of a two-dimensional random variable is used [22].
Application of the proposed methodology in analysing remote sensing data
The developed methodology was tested when analysing the remote sensing data [2; 28]. The object of the study is anthropogenic territories (quarry, suburban development) in the vicinity of the city of Krasnoyarsk. The initial information was formed on the fragments of Sentinel-2 satellite imagery on 26.08.2021 (Fig. 2). The spectral channels , were used. These channels are characterised by wavelengths (nanometres): (458523), (543578), (650680), (698713), (733748), (773793), (785899), (15651655), (21002280).
Рис. 2. Фрагменты спутниковой съемки Sentinel-2. Антропогенные территории:
a карьер; b пригородная застройка
Fig. 2. Fragments of Sentinel-2 satellite imagery. Anthropogenic territories:
a quarry; b suburban development
The proposed methodology allows forming pairs of independent and dependent random variables by changing the ratio between their parameters. The application of the methodology allowed us to detect 31 and 29 pairs of spectral features with strong linear dependence for the objects ‘quarry’ and ‘suburban development’, respectively. The obtained results are presented in Fig. 3.
Рис. 3. Иллюстрация сильной линейной зависимости между парами спектральных признаков (xi, xj), характеризующихся оценками коэффициентов корреляции больше 0,9:
a карьер; b пригородная застройка
Fig. 3. Illustration of a strong linear relationship between pairs of spectral features (xi, xj) characterized by correlation coefficient estimates greater than 0.9:
a quarry; b suburban development
Additionally, non-linear dependences between spectral features were found for the object ‘quarry’
, , , ,
and the object ‘suburban development’
.
The obtained results are reliable for all pairs of spectral features, since the condition > is met at = 0.029 and the risk = 0.025 reject the hypothesis of equality of values , .
The problem of detecting anthropogenic areas from spectral data is considered. The error of their recognition in the space of spectral features based on the training sample is equal to 0.012, where , = 3377 (‘quarry’, = 1), = 5049 (‘Suburban Development’ = 2). When excluding from the training sample, for example, the spectral features , , the estimates of pattern recognition errors correspond to the values 0.011; 0.01; 0.008. The obtained reduction in pattern recognition errors is not reliable compared to the error estimate in feature space , . Nevertheless, the obtained result justifies the possibility of reducing spectral features in the synthesis of decision-making algorithms and simplifying their optimisation.
Conclusion
The methodology of testing the hypothesis of independence of pairs of random variables, based on the use of nonparametric algorithm of pattern recognition, allows bypassing the problem of discretisation of the area of the values of random variables into multidimensional intervals. This problem is inherent in the generally recognised Pearson criterion. The conditions of competence of the proposed method and Pearson's criterion in the analysis of unambiguous and ambiguous dependences between random variables are determined. Using the apparatus of graph theory, the proposed method is developed in the formation of sets of independent random variables. The obtained results are generalised in testing the hypothesis of independence of random variables for large volumes of statistical data on the basis of compression of initial information, which allows increasing by orders of magnitude the computational efficiency of the problems being solved. The effectiveness of the proposed methodology is confirmed when analysing remote sensing data of anthropogenic territories and assessing their states. In the presence of a set of spectral features characterised by a strong linear dependence between its pairs, it is possible to reduce the number of spectral features in the recognition of anthropogenic territories with a decrease in the estimate of the probability of error in their recognition.