Journal of Samara State Technical University, Ser. Physical and Mathematical SciencesJournal of Samara State Technical University, Ser. Physical and Mathematical Sciences1991-86152310-7081Samara State Technical University3468310.14498/vsgtu1673Original ArticleA dual active set algorithm for optimal sparse convex regressionGudkovAleksandr Aleksandrovichalex-good96@mail.ruMironovSergei VladimirovichCandidate of physico-mathematical sciences, Associate professorMironovSV@info.sgu.ruSidorovSergei PetrovichDoctor of physico-mathematical sciences, Associate professorsidorovsp@yahoo.com, SidorovSP@info.sgu.ruTyshkevichSergey ViktorovichCandidate of physico-mathematical sciences, Associate professortyszkiewicz@yandex.ruSaratov State University1503201923111313010062020Copyright © 2019, Samara State Technical University2019The shape-constrained problems in statistics have attracted much attention in recent decades. One of them is the task of finding the best fitting monotone regression. The problem of constructing monotone regression (also called isotonic regression) is to find best fitted non-decreasing vector to a given vector. Convex regression is the extension of monotone regression to the case of $2$-monotonicity (or convexity). Both isotone and convex regression have applications in many fields, including the non-parametric mathematical statistics and the empirical data smoothing. The paper proposes an iterative algorithm for constructing a sparse convex regression, i.e. for finding a convex vector $z\in \mathbb{R}^n$ with the lowest square error of approximation to a given vector $y\in \mathbb{R}^n$ (not necessarily convex). The problem can be rewritten in the form of a convex programming problem with linear constraints. Using the Karush–Kuhn–Tucker optimality conditions it is proved that optimal points should lie on a piecewise linear function. It is proved that the proposed dual active-set algorithm for convex regression has polynomial complexity and obtains the optimal solution (the Karush–Kuhn–Tucker conditions are fulfilled).dual active set algorithmpool-adjacent-violators algorithmisotonic regressionmonotone regressionconvex regressionдвойственный алгоритмактивное множествоизотонная регрессиямонотонная регрессиявыпуклая регрессияIntroduction. For a given vector = (1 , . . . , ) from R , N, the finite difference operator of order 2 is defined as follows 2 = 1 +1 1 = +2 2+1 + , 1 2, where 1 = +1 is the finite difference operator of order 1 at . A vector = (1 , . . . , ) R is said to be convex if and only if 2 0 for each 1 2. A vector = (1 , . . . , ) R is called 1-monotone (or monotone) if +1 0, = 1, . . . , 1, and convex vectors also may be called 2-monotone vectors. The shape-constrained problems in statistics (the task of finding the best fitting monotone regression is one of them) have attracted much attention in recent decades [1–6]. One of most examined problems is the problem of constructing monotone regression (also called isotonic regression) which is to find best fitted non-decreasing vector to a given vector. One can find the detailed review on isotone regression in the work of Robertson and Dykstra [7]. The papers of Barlow and Brunk [8], Dykstra [9], Best and Chakravarti [10], Best [11] consider the problem of finding monotone regression in quadratic and convex programming frameworks. Using mathematical programming approach, the works [12–14] have recently provided some new results on the topic. The papers [15, 16] extend the problem to particular orders defined by the variables of a multiple regression. The recent paper [1] proposes and analyzes a dual active-set algorithm for regularized monotonic regression. -monotone regression is the extension of monotone regression to the general case of -monotonicity [17]. Both isotone and -monotone regression have applications in many fields, including the non-parametric mathematical statistics [3, 18], the empirical data smoothing [19–21], the shape-preserving dynamic programming [22], the shape-preserving approximation [23–25]. Denote 2 the set of all vectors from R , which are convex. The task of constructing convex regression is to obtain a vector R with the lowest square error of approximation to the given vector R (not necessarily convex) under condition of convexity of : ( ) ( ) = ( )2 min . =1 2 (1) In this paper using the ideas of [26] we develop a new algorithm (the dual active set algorithm) for finding sparse convex regression. We prove that the dual Sergei P. Sidorov https://orcid.org/0000-0003-4047-8239 Dr. Phys. Math. Sci.; Head of Department; Dept. of Functions and Approximations Theory; e-mail: sidorovsp@yahoo.com Sergei V. Tyshkevich https://orcid.org/0000-0001-5417-6165 Cand. Phys. Math. Sci.; Associate Professor; Dept. of Functions and Approximations Theory; e-mail: tyszkiewicz@yandex.ru 114 A dual active set algorithm. . . active set algorithm proposed in this paper is optimal, i.e. the solutions obtaining by the algorithm are optimal. 1. The dual active set algorithm for convex regression. Preliminary Analysis. The problem (1) can be rewritten in the form of a convex programming problem with linear constraints as follows 1 () = min, 2 (2) where minimum is taken over all R such that () := (+2 2 + ) 0, 1 2. (3) The problem (2)–(3) is a quadratic programming problem and is strictly convex, and therefore it has a unique solution. Let ^ be a (unique) global solution of (2)–(3), then there is a Lagrange multiplier = (1 , . . . , '2 ) R2 such that () + 2 () = 0, (4) =1 () 0, 0, () = 0, 1 1 1 2, 2, 2, (5) (6) (7) where denotes the gradient of . The equations (4)–(7) are the well-known Karush–Kuhn–Tucker optimality conditions. It follows from (4) that ] [ 2 1 2 ( ) + (+2 + 2+1 ) = 0, 2 =1 i.e. 1 2, =1 1 1 1 = 0 2 2 2 + 21 = 0 3 3 3 + 22 1 = 0 . . . + 21 2 = 0 ... 2 2 2 + 23 4 = 0 1 1 + 22 3 = 0 2 = 0 (8) One can ask the natural question: is it possible to reduce the problem of constructing convex regression for points = (1 , 2 , . . . , ) to the problem of constructing monotone regression for points = (2 1 , 3 2 , . . . , 1 ) The answer to this question is negative. First of all, the Karush–Kuhn–Tucker optimality conditions for these problems are not identical. Secondly, we give a simple example showing that it is not possible to lower the order of monotonicity. If we take = (1, 0, 1, ) = (1, 1, 1, 1). Optimal monotone ( 0,11) then 1 1 regression for is = , , 3 3 3 , 1 . The convex regression recovered from ( ) is 1, 32 , 31 , 0, 1 . The square error of approximation to = (1, 0, 1, 0, 1) is ( )2 ( )2 equal to 0 + 23 + 32 + 0 + 0 = 89 . ( ) On the other hand, if we take the convex sequence = 1, 31 , 13 , 13 , 1 , then the error of approximation to the given vector = (1, 0, 1, 0, 1) is equal to ( )2 ( )2 ( )2 0 + 13 + 32 + 13 + 0 = 59 , which is less than the error for the convex regression recovered from monotone regression for . If one sums up all equalities in (8) then =1 = . (9) =1 If we multiply -th equality in (8) by and sum up the resulting equalities, we obtain = . (10) =1 =1 Thus, equalities (9)–(10) are necessary conditions for optimal solution. Preliminary analysis of (4)–(7) shows that the first order differences of the optimal solution ^ should be sparse, i.e. the sequence 1 ^ , 1 2, should have many zeroes. It follows from the Karush–Kuhn–Tucker conditions that optimal points should lie on a piecewise linear function. Note that some of these linear functions may be linear regressions constructed from the points of corresponding blocks. In this case the necessary conditions (9)–(10) are fulfilled. A dual active-set algorithm for convex regression. In this subsection we propose an algorithm such that – it is with polynomial complexity, i.e. the number of operations required to complete the algorithm for a given input from R is ( ) for some nonnegative integer ; – the solution is convex (the reachability analysis will show this); – the solution is optimal (the Karush–Kuhn–Tucker conditions are fulfilled). The proposed algorithm uses so called active set. The active set consists of blocks of the form [, 2] [1, 2], such that [, 2] , 1 / , 1 / , and = [1 , 1 ] [2 , 2 ] . . . [1 , 1 ] [ , ], where 1 1, 2, + 3 +1 , [1, 1], and is the number of blocks. If = then the -th block consists of only one point. Points , +1 , . . . , , +1 , +2 corresponding to the -th block (plus two points from right) lies on a linear line, for each . At each iteration of the algorithm, the active set [1, 2] is chosen and the corresponding optimization problem is solved 1 ( )2 min, 2 =1 where the minimum is taken over all R satisfying 2+1 + +2 = 0 . (12) Note that there exists a unique solution to the problem (11)–(12). Let us denote the solution by (). The dual active-set algorithm for convex regression begin · Input R ; · Active set = ; · Initial point () = ; · while () / 2 do · set { : () 2+1 () + +2 () 0}; · solve (11)–(12) using points from the active set ; · update (); · Return (); end The computational complexity of the dual active set algorithm for convex regression is (3 ). It follows from two remarks: – at each iteration of the algorithm, the active set is attaching, at least, one index from [1 : 2], which means that the number of the while loop iterations can not be greater that 2. – the computational complexity of solving the problem (11)–(12) is (2 ). 2. The convergence analysis of the dual active set algorithm. Lemmas. The next lemma finds the values of Lagrange multipliers. Lemma 1. Let be a global solution of (2)–(3). Then the values of the Lagrange multipliers = (1 , . . . , 2 ) R2 defined in (4)–(7) are = ( + 1)( ), 1 2. (13) =1 P r o o f. Lemma can be proved by induction. If = 1 then it follows from (8) that 1 1 = 1 . (14) Then from (8) and (14) we obtain 2 2 = 2 21 = 2 2(1 1 ), and therefore 2 = (2 2 ) + 2(1 1 ). Suppose that (13) is fulfilled for 2. Then it follows from (8) that +1 +1 = +1 2 + 1 , and consequently we have +1 = (+1 +1 ) + 2 1 = 1 = (+1 +1 ) + 2 ( + 1)( ) ( )( ) = =1 =1 = (+1 +1 ) + 2( ) + 1 ( ) 2( + 1) ( ) ( ) = =1 1 = (+1 +1 ) + 2( ) + ( + 2)( ) = =1 = +1 ( + 2)( ). =1 Lemma 2. Let 1 , i.e. 2 1 0, and assume that 2, 3 / . Let 1 , 2 , 3 be the values of linear regression constructed for input points (1, 1 ), (2, 2 ), (3, 3 ) (see Fig. 1(a)). Then the values of corresponding Lagrange multipliers defined in (13) are non-negative. P r o o f. We will show that 1 = 1 1 0, 2 = 2(1 1 ) + (2 2 ) = 0, 3 = 3(1 1 ) + 2(2 2 ) + (3 3 ) = 0. We have 1 = 1 1 0, since 1 , 2 , 3 lie on a concave line. It follows from (9) and (10) that 3 ( ) = 0 (15) =1 and 3 ( ) = 0. (16) =1 1 2 3 (a) 4 5 1 2 3 (b) 4 5 Figure 1. (a) The case 1 , i.e. 2 1 0, and 2, 3 / , 1 , 2 , 3 are the values of linear regression constructed for input points (1, 1 ), (2, 2 ), (3, 3 ). (b) The case 1, 3 , i.e. 2 1 0, 2 3 0, and 2, 4, 5 / , ’s are the solutions to the problem (17) If we multiply (15) by 3 and subtract (16), we get 2(1 1 ) + (2 2 ) = 0, and therefore 2 = 0. If we multiply (15) by 4 and subtract (16), we get 3(1 1 ) + 2(2 2 ) + (3 3 ) = 0, and therefore 3 = 0. Lemma 3. Let 1 , 2 , 3 , 4 , 5 be such that 1, 3 , i.e. 2 1 0, 2 3 0, and assume that 2, 4, 5 / (see Fig. 1(b)). Let be the solutions to the optimization problem 5 1 ( )2 min, (17) 2 =1 where minimum is taken over all = (1 , . . . , 5 ) such that 2 1 = 2 3 = 0. Then the values of corresponding Lagrange multipliers 1 , . . . , 5 defined in (13) are non-negative. P r o o f. Since 2 1 0 we have 1 = 1 1 constrained optimization problem (17) is 0. The Lagrangian for the 5 (, 1 , 2 ) = 1 ( )2 1 (1 22 + 3 ) 3 (3 24 + 5 ), 2 =1 where multipliers 1 , 2 satisfy = 1 1 1 = 0, 1 = 2 2 + 21 = 0, 2 = 3 3 1 2 = 0, 3 = 4 4 + 22 = 0, 4 = 5 5 2 = 0. 5 Then 2 = 2(1 1 ) + (2 2 ) = 21 21 = 0, 3 = 3(1 1 ) + 2(2 2 ) + (3 3 ) = 31 41 + 1 + 2 = 2 , 4 = 4(1 1 )+3(2 2 )+2(3 3 )+(4 4 ) = 41 61 +21 +22 22 = 0, 5 = 5(1 1 )+4(2 2 )+. . .+(5 5 ) = 51 81 +31 +32 42 +2 = 0. Since 2 3 0 we have 2 = 5 5 0, and therefore 3 0. If 2 1 4 3 (as it is shown in Fig. 1(b)) then 2 0 for all [1 : 3], and the algorithm constructed a convex solution. If 2 1 4 3 (as it is in Fig. 2(a)), then the point 2 must be added to the active set = {1, 3} at the next iteration of the algorithm and the block {1, 2, 3} of the active set will be formed. The solution to the optimization problem (11)–(12) corresponding to the block {1, 2, 3} are points lying on linear regression line constructed for points (1, 1 ), . . . , (5, 5 ) (see Fig. 2(b)). Note that the same regression will be constructed for the initial points (1, 1 ), . . ., (5, 5 ). Figure 2. (a) The case 1, 3 , i.e. 2 1 0, 2 3 0, and 2, 4, 5 / , ’s are the solutions to the problem (17), 2 1 4 3 . (b) The solution to the optimization problem (11)–(12) corresponding to the block {1, 2, 3} are points lying on linear regression line constructed for points (1, 1 ), . . . , (5, 5 ) Lemma 4. Let at an iteration of the algorithm the pairs = {(1, 1 ), . . . , (, +1 )} be such that [1 : 2] , 2 0 for all [1 : 2], and 1, , + 1 / (0) (see Fig. 3). Let , [1 : ], be the solution to the optimization problem 1 ( )2 min, 2 =1 where the minimum is taken over all R such that 2+1 + +2 = 0 for (0) (0) all [1 : 2]. Assume that 1 2 + +1 0, i.e. the point 1 (1) should be added to the active set at the next iteration of the algorithm. Let , [1 : + 1], be the solution to the optimization problem +1 1 ( )2 min, 2 =1 (0) 1 2 3 4 ... 2 1 (0) ’s +1 Figure 3. The case 1, 2, . . . , 2 and 1 / , are the values of linear regression constructed for input points (, ), [1 : ]. After the iteration, the point 1 must be added (0) (0) to the block [1 : 2] since 1 2 + +1 0 120 A dual active set algorithm. . . where the minimum is taken over all R+1 satisfying 2+1 + +2 = 0 for all [1 : 1]. Then for every = 1, 2, . . . , we have (1) (0) (0) (1) where , (1) +1 = 0. 0, (1) are Lagrange multipliers for (0) and (1) respectively, and = (0) P r o o f. Note that are the values of the linear regression which uses input data points {(1, 1 ), . . . , (, )}, obtained at points [1 : ]. By the same (1) reason, are the values of linear regression which uses input data points points [1 : + 1]. It follows from (13) that (0) = ( + 1 (0) )( ), (1) =1 (1) = ( + 1 )( ) =1 for all [1 : ], and therefore, we have (1) (0) = (1) (0) ( + 1 )( ). (18) =1 (0) Without the loss of generality we will assume that points , [1 : ] , lie on a straight line passing through the origin and with a slope , where 1 1, i.e. (0) = , = 1, . . . , . (19) (0) (0) Since 1 2 + +1 0, there exists 0 such that +1 = ( + 1) . (1) In this case the points , = 1, . . . , + 1, lie on a straight line with a slope 2 6 and intercept +1 , i.e. (+1)(+2) (1) ( = ) 2 6 + , ( + 1)( + 2) +1 = 1, . . . , + 1. (20) It follows from (18), (19) and (20) that (1) (0) = ( ( ( + 1 ) =1 = ( ( + 1 ) =1 = ( + 1) ) 6 2 + ( + 1)( + 2) +1 ) = 6 2 ) + = ( + 1)( + 2) + 1 =1 =1 6 2 + ( + 1) 1+ ( + 1)( + 2) +1 + =1 =1 6 2 2 = ( + 1)( + 2) +1 ( + 1) 6 2 + ( + 1) + ( + 1)( + 2) 2 +1 ( + 1)(2 + 1) 2 ( + 1) 6 = + ( + 1)( + 2) 6 +1 2 ) ( + 1)( ) ( + 1) ( 3( + 1) 2 + 1 = +2+ 1 = ( + 1) +2 +2 ( + 1)( + 2) = ( + 1) 0. Remark. It follows from Lemmas 2 and 4 that the values of Lagrange multipliers are increasing when adding a point to an existing block from right hand side. Lemma 5. Suppose that before an iteration of the algorithm the pairs = {(1, 1 ), . . . , (, )} are such that [2 : 2] , i.e. 2 = 0 for all [2 : 2], and 1, / . If the inequality 1 22 + 3 0 holds (see Fig. 4), i.e. 1 must be added to the block [2 : 2] at the iteration, then we obtain 0 for every (1) [1 : ], where , [1 : ], corresponds to the solutions , [1 : ], of the optimization problem 1 ( )2 min, 2 =1 where the minimum is taken over all R such that 2 = 0 for all [1 : 2]. (0) P r o o f. If we take = {1}, then it follows from Lemma 2 that 1 0, (0) (0) 2 = 3 = 0. Now let us consider the sequence of the basic operations of adding one point to a block from the right hand side: (1) (1) (1) (1) – = {1, 2}, then we have 1 , 2 0, 3 = 4 = 0 by Remark. (1) 1 2 3 4 ... 2 1 Figure 4. The case [2 : 2] , i.e. 2 = 0 for all [2 : 2], and 1, / . After the iteration, the point 1 must be added to the block [1 : 2] since 1 22 + 3 0. (1) Points , [1 : ], are the solutions to the optimization problem (5). (2) (2) (2) (2) (2) – = {1, 2, 3}, then we have 1 , 2 , 3 0, 4 = 5 = 0 (by Remark). – ... (3) (3) (3) (3) – = {1, 2, . . . , 2}, then 1 , . . . , 2 0, 1 = = 0 (by Remark). At each step we add a point to an existing block of the active set from the right hand side, and therefore the values of Lagrange multipliers are increasing, i.e. they are remaining non-negative. Lemma 6. Suppose that before an iteration of the algorithm the pairs 1 = {(1, 1 ), . . . , ( 3, 3 )}, 2 = {(, ), . . . , (, )}, 4 , are such that [1 : 3] , [ : 2] , 2, 1 / , i.e. 2 = 0 for 2 all [1 : 3], = 0 for all [ : 2], and 1, / . Assume that one of the following conditions is fulfilled : – (the case 1) the inequality 2 21 + 0 holds (see Fig. 5), i.e. 2 must be added to at the iteration; – (the case 2) the inequality 2+1 + +2 0 holds (see Fig. 6), i.e. must be added to at the iteration. (1) Let , [1 : ], be the solutions to the optimization problem 1 ( )2 min, 2 (21) =1 where the minimum is taken over all R satisfying 2+1 + +2 = 0 for all [1 : 2] { 1}. Then for every = 1, 2, . . . , we have (1) 0, (1) where are the values of Lagrange multipliers corresponding to the solutions (1) , [1 : ]. (1) 1 2 ... 2 1 +1 +2 ... 1 Figure 5. The case 1 of merging two blocks (1) 1 2 ... 2 1 +1 +2 ... 1 Figure 6. The case 2 of merging two blocks (1) P r o o f. The case 1. If , [1 : ], are the optimal solutions to the problem (21) then there exists a point * such that – 2 21 + * 0, (1) – , [1 : ], are lying on the regression line constructed by points (1, 1 ), . . . , ( 1, 1 ), (, * ). It follows from Lemma 1 that 1 , 2 , . . . , 1 0, 1 = 0. ' There exists a point +1 such that ' – the inequality 1 2 + +1 0 is fulfilled, (1) – points , [ : ], are lying on the regression line constructed by points ' (, ), ( + 1, +1 ), . . . , (, ), ( + 1, +1 ). Then it follows from Lemma 1 that , +1 , . . . , 0, +1 = 0. (1) The case 2. If , [1 : ], are the optimal solutions to the problem (21) then there exists a point 0* such that – the inequality 0* 21 + 2 0 holds, (1) – points , [1 : ], are lying on the regression line constructed by points (0, 0* ), (1, 1 ), . . . , ( 1, 1 ), (, ). It follows from Lemma 5 that 1 , 2 , . . . , 1 0, 1 = 0. There exists a point ' such that – the inequality ' 2+1 + +2 0 is fulfilled, (1) – points , [ : ], are lying on the regression line constructed by points (, ' ), ( + 1, +1 ), . . . , (, )). Then it follows from Lemma 5 that , +1 , . . . , 1 0, = 0. Availability and optimality analysis. When the algorithm is completed, the output vector has the form (1 , . . . , 1 , 1 +1 , . . . , 2 , . . . , 1 +1 , . . . , ), 1 times 2 times times where – is the number of segments; – is the number of points at th segment, 1 = 1 , 2 = 1 + 2 , . . . , +1 = 1 + 2 + . . . + , = 1 + 2 + . . . . All points of each segment [ + 1, . . . , +1 ] lie on a straight line. A segment can consists of one point. The solution obtained as a result of the algorithm is convex, by design of the algorithm. It remains to show that the solution is optimal, i.e. it is necessary to show that the Karush–Kuhn–Tucker conditions are fulfilled. The algorithm starts with any active set such that * , where * is the active set corresponding the optimal solution. At he first iteration we can always take = for simplicity. Theorem. For any initial * , the Algorithm converges to the optimal solution of the problem (1) in, at most, 1 || iterations. P r o o f. At each while loop of the algorithm, the active set is expanded by attaching at least one index point from [1, 2], which is previously not belonged to the set . The Algorithm terminates when () becomes convex. If = [1, 2], then the number of blocks is equal to 1 and, therefore, there is no violation of the convexity. If || 2 then the number of iterations must be less than ||, where || is the number of indices in the initial active set . The optimality of the solution follows from Lemmas 2–6. At each iteration of the algorithm, the points at which the convexity is violated are attached to the active set . If one of such points with a negative value of the second order finite difference 2 is isolated (i.e. 2, 1, + 1, + 2 / ), then the algorithm replaces , +1 , +2 with the values of linear regression constructed by the points (, ), ( + 1, +1 ), ( + 2, +2 ). Lemma 2 states that the values of the corresponding Lagrange multipliers will be non-negative. If the second-order differences are negative at several consecutive 1 neighboring points, e.g. , +1 , . . . , + are such that 2 0, = , . . . , + , and 2 2 , 2 1 , 2 ++1 , 2 ++2 0, then the algorithm replaces , +1 , . . . , ++2 with the values of a linear regression constructed by the points (, ), ( + 1, +1 ), ( + + 2, ++2 ). Lemma 4 states that the values of the corresponding Lagrange multipliers will be non-negative. When the algorithm works on subsequent iterations, it may turn out that a point with a negative second-order finite difference is not isolated. In this case the point where the violation of convexity takes place will be attached to the one of the neighboring blocks of the active set as follows: – Let 1 2 . . . 2 be such that + 1 1 , 1 + 1 2 , . . . , + 1 +1 , . . . , + 1 2. Suppose that [ : 2] and 1 , 2 , . . . , / at the previous iteration of the algorithm. If 1 must be added to the active set at the current iteration of the algorithm, i.e. 1 2 ++1 0 for the current solution , then it follows from Lemma 1 that the values of the corresponding Lagrange multipliers will increase, and therefore they will remain non-negative. – Let 1 2 . . . 2 be such that + 1 1 , 1 + 1 2 , . . . , + 1 +1 , . . . , + 1 2. Suppose that [ : 2] and 1 , 2 , . . . , / at the previous iteration of the algorithm. If 1 must be added to the active set at the current iteration of the algorithm, i.e. 2 1 0 for the current solution , then it follows from Lemma 5 that the values of the corresponding Lagrange multipliers will increase, and therefore they will remain non-negative. – Let 1 2 . . . 2 be such that + 1 1 , 1 + 1 2 , . . . , + 1 +1 , . . . , + 1 2. Suppose that [ : 2] and 1 , 2 , . . . , / at the previous iteration of the algorithm. If must be added to the active set at the current iteration of the algorithm, i.e. 2+1 ++2 0 for the current solution , then it follows from Lemma 6 that the values of the corresponding Lagrange multipliers will increase, and therefore they will remain non-negative. – Let 1 2 . . . 2 be such that + 1 1 , 1 + 1 2 , . . . , + 1 +1 , . . . , + 1 2. Suppose that [ : 2] and 1 , 2 , . . . , / at the previous iteration of the algorithm. If 2 must be added to the active set at the current iteration of the algorithm, i.e. 2 2 0 for the current solution , then it follows from Lemma 6 that the values of the corresponding Lagrange multipliers will increase, and therefore they will remain non-negative. The remaining cases of mergers for two or more adjacent blocks and isolated points can be represented as sequences of these basic cases. Conclusion. In this paper using the ideas of [26] we develop a new algorithm (the dual active set algorithm) for finding sparse convex regression. At each iteration of the algorithm, it first determines the active set and then solve a standard least-squares subproblem on the active set with small size, which exhibits a local superlinear convergence. Therefore, the algorithm is very efficient when coupled with a parallel execution. The classical optimization algorithms for nonconvex problems (e.g. coordinate descent or proximal gradient descent) only possess sublinear convergence in general or linear convergence under certain conditions [27]. On the other hand, -FWA and -PAVA examined in [17] are not optimal.[Burdakov O., Sysoev O., "A Dual Active-Set Algorithm for Regularized Monotonic Regression", J. Optim. Theory Appl., 172:3 (2017), 929-949][Brezger A., Steiner W. J., "Monotonic Regression Based on Bayesian P-Splines: An application to estimating price response functions from store-level scanner data", J. Bus. Econ. Stat., 26:1 (2008), 90-104][Chen Y., Aspects of Shape-constrained Estimation in Statistics, PhD Thesis, University of Cambridge, Cambridge, UK, 2013][Balabdaoui F., Rufibach K., Santambrogio F., "Least-squares estimation of two-ordered monotone regression curves", J. Nonparametric Stat., 22:8 (2010), 1019-1037][Hazelton M. L., Turlach B. A., "Semiparametric Regression with Shape-Constrained Penalized Splines", Comput. Stat. Data Anal., 55:10 (2011), 2871-2879][Lu M., "Spline estimation of generalised monotonic regression", J. Nonparametric Stat., 27:1 (2014), 19-39][Robertson T., Wright F. T. Dykstra R. L., Order Restricted Statistical Inference, Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons, Chichester (UK) etc., 1988, xix+521 pp.][Barlow R. E. Brunk H. D., "The Isotonic Regression Problem and its Dual", J. Amer. Stat. Assoc., 67:337 (1972), 140-147][Dykstra R. L., "An Isotonic Regression Algorithm", J. Stat. Plan. Inf., 5:1 (1981), 355-363][Best M. J., Chakravarti N., "Active set algorithms for isotonic regression: A unifying framework", Mathematical Programming, 47:3 (1990), 425-439][Best M. J., Chakravarti N., Ubhaya V. A., "Minimizing Separable Convex Functions Subject to Simple Chain Constraints", SIAM J. Optim., 10:3 (2000), 658-672][Ahuja R. K., Orlin J. B., "A Fast Scaling Algorithm for Minimizing Separable Convex Functions Subject to Chain Constraints", Operations Research, 49:1 (2001), 784-789][Strömberg U., "An Algorithm for Isotonic Regression with Arbitrary Convex Distance Function", Comput. Stat. Data Anal., 11:2 (1991), 205-219][Hansohm J., "Algorithms and Error Estimations for Monotone Regression on Partially Preordered Sets", J. Multivariate Analysis, 98:5 (2007), 1043-1050][Burdakow O., Grimvall A., Hussian M., "A Generalised PAV Algorithm for Monotonic Regression in Several Variables", COMPSTAT, Proceedings in Computational Statistics, 16th Symposium Held in Prague, Czech Republic, v. 10, ed. J. Antoch, Springer-Verlag, New York, 2004, 761-767][Dykstra R. L., Robertson T., "An Algorithm for Isotonic Regression for Two or More Independent Variables", Ann. Statist., 10:3 (1982), 708-716][Sidorov S. P., Faizliev A. R., Gudkov A. A., Mironov S. V., "Algorithms for Sparse -Monotone Regression", Integration of Constraint Programming, Artificial Intelligence, and Operations Research, Lecture Notes in Computer Science, 10848, ed. W.-J. van Hoeve, Springer International Publishing, Cham, 2018, 546-556][Bach F. R., Efficient Algorithms for Non-convex Isotonic Regression through Submodular Optimization, 2017][Altmann D., Grycko Eu., Hochstättler W., Klützke G., Monotone smoothing of noisy data, Technical Report feu-dmo034.15, FernUniversität in Hagen, 2014][Gorinevsky D. M., Kim S.-J., Beard Sh., Boyd S. P., Gordon G., "Optimal Estimation of Deterioration From Diagnostic Image Sequence", {IEEE} Transactions on Signal Processing, 57:3 (2009), 1030-1043][Hastie T. Tibshirani R. Wainwright M., Statistical Learning with Sparsity: The Lasso and Generalizations, Monographs on Statistics and Applied Probability, 143, Chapman and Hall/CRC, New York, 2015, xvi+351 pp.][Cai Y., Judd K. L., "Advances in Numerical Dynamic Programming and New Applications", Handbook of Computational Economics, v. 3, eds. K. Schmedders, K. L. Judd, Elsevier, 2014, 479-516][Boytsov D. I., Sidorov S. P., "Linear approximation method preserving -monotonicity", Sib. Èlektron. Mat. Izv., 12 (2015), 21-27][Cullinan M. P., "Piecewise Convex-Concave Approximation in the Minimax Norm", Abstracts of Conference on Approximation and Optimization: Algorithms, Complexity, and Applications (June 29-30, 2017, Athens, Greece), eds. I. Demetriou, P. Pardalos, National and Kapodistrian University of Athens, Athens, 2017, 4][Shevaldin V. T., Approksimatciia lokalnymi splainami [Local approximation by splines], Ural Branch of RAS, Ekaterinburg, 2014, 198 pp. (In Russian)][Leeuw J. Hornik K. Mair P., "Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods", J. Stat. Softw., 32:5 (2009), 1-24][Li G., Pong T. K., "Calculus of the Exponent of Kurdyka-Łojasiewicz Inequality and Its Applications to Linear Convergence of First-Order Methods", Found. Comput. Math., 18:5 (2017), 1199-1232]