PARTIALLY OBSERVED DISTRIBUTED OPTIMIZATION UNDER UNKNOWN–BUT–BOUNDED DISTURBANCES

In this paper, we consider non-stationary distributed optimization with partially observed parameters with acceleration based on the estimate sequence proposed by Y. Nesterov. We formulate this partial observability as time-varying communication matrix defined for each parameter separately. We propose the new distributed algo-rithm combining the accelerated Simultaneous Perturbation Stochastic Approximation (SPSA) and the described communication scheme as well as show its theoretical properties. The simulation validates the proposed algo-rithm in multi-sensor multi-target tracking problem over delayed channels


Introduction
Nowadays, distributed networks emerge in many practical areas such as transportation, telecommunication, logistics, opinion dynamics, flocking behaviour, multi-vehicle networks [Yu et al., 2010], [Ren et al., 2007], [Granichin et al., 2012], etc. The problems arising in network control systems are the subject of ever-growing research interest. For example, large-scale systems may be influenced by communication bottlenecks. In that sense, it is reasonable to impose communication constraints. These constraints can be accounted for through sparsification techniques. In optimization, it gained special interest due to the need for communication-efficient distributed learning. In this field, the researchers proposed compression operators that produce sparse vectors to be sent over communication channels, see, e.g. [Horváth and Richtarik, ] and references therein. On the other hand, sparse structure may reflect a property of a system itself. In multi-area state estimation, the vector to be optimized is divided into local and boundary subsets. The local variables should stay private while the boundary ones can be exchanged with neighboring computing nodes. Both approaches produce a decomposition of parameters and lead to partially observed distributed optimization. In general, distributed algorithms may be more efficient than centralized ones due to their resilience, local communication and processing. While centralized approaches may produce communication and computation bottlenecks in large-scale systems, distributed ones successfully overcome such difficulties.
Another important problem in optimization is the improvement of convergence rate. Acceleration techniques have been studied for several decades. Heavy ball, which is the gradient descent with momentum, is asymptotically optimal among gradient-based methods on quadratics [Polyak, 1964]. Its stochastic variant, where gradient is replaced by a stochastic estimator, is widely used in deep learning. Additionally, momentum and stepsize parameters can be estimated without the knowledge regarding the Hessian's smallest singular value, in contrast to classical accelerated methods like Nesterov acceleration and Polyak momentum [Pedregosa and Scieur, 2020]. The heavy ball method with constant step-sizes has a long history. It is known, for example, to achieve optimal black-box worst-case complexity of quadratic convex optimization [Nemirovsky, 1992]. In [Nesterov and Spokoiny, 2017], the authors consider random derivative-free methods and provide them with some complexity bounds for different classes of convex optimization problems as well as accelerated methods for smooth convex derivative-free optimization. In [Vorontsova et al., 2019], the authors propose an accelerated gradient-free method with a non-Euclidean proximal operator. Paper [Gorbunov et al., 2022] describes an accelerated method for smooth stochastic derivative-free optimization with two-point feedback. The latter paper considers additionally possibly adversarial noise in the objective function value and analyze how this noise affects the convergence rate of the estimates. Stochastic scenarios are typically challenging. In contrast to classical optimization problems, stochastic ones bring additional problems with convergence of the algorithms (for example, gradient descent or Nesterov's accelerated gradient) [Scieur, 2018]. Moreover, some accelerated methods like heavy ball or Nesterov's accelerated gradient may also exhibit nonmonotonic convergence due to peak effects [Polyak et al., 2018;Ahiyevich et al., 2018].
In this paper, we combine our research in accelerated techniques for non-stationary distributed optimization under unknown-but-bounded disturbances and partially observed reformulation for the consensus term. This paper continues the line of research devoted to the improved distributed Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm proposed in [Erofeeva and Granichin, 2023] and incorporates a new communication scheme.
The paper is organized as follows. The preliminary information is given in Section 2. A formal problem setting of non-stationary distributed optimization is given in Section 3. The partially observed communication matrix is described in Section 4. The main result is presented in Section 5. In Section 6, the efficiency of the proposed algorithm is illustrated through the numerical simulation. Section 7 concludes the paper.

Preliminaries
Let (Ω, F, P ) be the underlying probability space corresponding to sample space Ω, set of all events F, and probability measure P . E denotes mathematical expectation. Let F t−1 be the σ-algebra of all probabilistic events which happened up to time instant t = 1, 2, . . ., E Ft−1 denotes the conditional mathematical expectation with respect to σ-algebra F t−1 .
Throughout the paper, we represent d-dimensional column vectors as lowercase bold symbols (e.g. x = [x 1 , . . . , x d ] T ), and scalars as non-bold symbols. ⊗ is the Kronecker product.
[·] ⊤ is the matrix or vector transpose. | · | is the cardinality of a set or the total number of unique elements in a set. 1 d ∈ R d is a vector of all ones. I d ∈ R d×d is the identity matrix. 0 d ∈ R d is a vector of all zeros.

Networked System
Consider a networked system consisting of n nodes. Nodes are able to communicate with each other through a network described by undirected graph G = (N , E), where N = {1, . . . , n} is a set of vertices and E ⊆ N × N is a set of edges. The vertices correspond to the areas while the edges represent the information flows between them. For node i ∈ N , the set of neighbors is defined as N i = {j ∈ N : (i, j) ∈ E}. The degree of i ∈ N equals |N i | and is defined as deg(i). A subgraph of G is a graphḠ = (NḠ, EḠ), where NḠ ⊆ N and EḠ ⊆ E.
Alternatively, we express the communication between nodes in a matrix form through a communication matrix for which we adopt the following definition from [Makhdoumi and Ozdaglar, 2017]: Definition 1 (Communication matrix). Let W be a m × m matrix whose entries satisfy the following property. For any i ∈ N , W ij = 0 for j / ∈ N i . We refer to W as the communication matrix.
Next, we impose the assumptions on the communication matrix and graph G.
Assumption 2. Graph G is connected, i.e., there is a path between every pair of distinct vertices of G.
Taking into account Assumptions 1 and 2, one of the available choices for communication matrix is the Laplacian of G, which is L(G) = (l ij ) i,j∈N :

Distributed Non-stationary Mean-risk Optimization
. . , θ m t ] T ∈ R md be a vector of unknown parameters θ j t ∈ R d , j = 1, . . . , m to be estimated at time instant t = 1, 2, . . .. Each parameter evolves in accordance with a state-transition model: where A j t ∈ R md×md is a transition matrix, {ξ j t }, ξ t ∈ Ξ, is a non-controllable deterministic (e.g., Ξ = N and ξ t = t) or random sequence. In the latter case we assume that a probability distribution of ξ t exists and may be known or unknown. This sequence indicates abrupt changes in the dynamics (e.g., in target tracking problems, maneuvers of moving objects, or, in power systems, sudden changes in system operating conditions due to power injections), a disturbance (e.g., discretization and model approximation errors, external signal injection during a cyber-attack).
Remark: model (1) is widely used in the works devoted to state estimation, e.g., power system dynamic state estimation [Zhao et al., 2019].
Each node collects measurements represented by a linear model: where H i ∈ R l×md is a measurement matrix, w i t ∈ R l is the noise following Gaussian distribution with zero mean and standard deviation σ w . Then, the problem is to find estimate θ t of unknown parameter θ t minimizing function ( 3) based on aggregated measurements z i t received by a fusion center. Here, H ∈ R nl×nmd is a block-diagonal matrix consisting of H i on its diagonal,θ t = 1 n ⊗ θ t , z t = [z 1 t , . . . , z n t ] T . In many applications, first-order methods have the bottleneck appearing due to the computation of ∇f . To motivate this statement, let us mention a few such examples: Evaluating the gradient of (3) requires O(n 2 mdl) arithmetic operations. In large-scale applications, it becomes unrealistic to calculate the gradient in realtime.
In some cases, the computation of ∇f involves a black-box simulation procedure. Then, it's impossible to obtain the gradient implicitly.
This work considers zeroth-order optimization, where we have only measurements of function to be optimized. We obtain ∇f through stochastic approximation via finite differences instead of implicit calculations [Kiefer et al., 1952]. In finite differences, we observe how the function behaves around a current point. For this purpose, we introduce a sequence of controllable measurement points x 1 , x 2 , . . . chosen according to an observation plan, e.g., x t = θ t ± ϵ, ϵ is a random variable drawn from a known distribution.
Remark: In the example of observation plan, we use θ t with the assumption that θ t is observable. In some applications, we cannot directly measure value of θ t . However, it is possible to make an observation plan consisting of quantities that influence the estimating parameter and we can indirectly observe how it evolves. One such example could be found in [Amelina et al., 2015].
The values y 1 , y 2 , . . . of the functions f ξt (·) are observable at every time instant t with additive external unknown-but-bounded noise v t Equation (4) is called a stochastic zeroth-order oracle that returns a noisy value of function f ξt (·). Problem (3) requires a centralized optimization procedure. In practice, centralized framework is subject to performance limitations, such as a single point of failure, high communication requirement, and substantial computation burden. All of these aspects have influenced the development of distributed approaches. Thereby, we consider an optimization problem in which the cost func-tionF t ( θ t , z t ) is expressed as the sum of local contributions and all of them have a common minimizer. Moreover, minimizer θ ⋆ t ofF t ( θ t , z t ) may vary over time. Formally, the nonstationary mean-risk optimization problem is as follows: estimate the time-varying minimum point θ ⋆ t of the distributed function In the distributed setting, the sensors construct sequence of measurement points x i 1 , x i 2 , . . . and collect y i 1 , y i 2 , . . . independently from each other based on their own measurements z i t and current estimate θ i t .

Partially Observed Optimization: Motivation
In wide-area estimation problems, the underlying interconnected system is usually partitioned into regions based on some criteria, e.g., range of sensors. Such problems occur, for example, in multi-target tracking. The system states of each region are monitored and managed by a local control unit referred to as node. The overall goal is to estimate the states in each region in an optimal way. In general, the nodes can estimate their local states without communicating with neighboring regions. However, it may bring some issues: estimated solution may be sub-optimal; the estimates obtained by different nodes for the same target should be consistent, then there is a need for communication between those regions in order to utilize the measurements. For illustration purposes, consider Figure 1. There are three regions and nodes corresponding to them. Each node estimates parameters of the targets detected in their range. The ranges intersect and some targets are detected by several nodes simultaneously. The parameters of these targets form boundary variables for neighboring nodes. Therefore, towards optimal estimation, the regions should communicate with each other. This could be done through information sharing of states related to boundary variables. Hence, for a partially-observed state estimation, each region would align shared states with neighboring regions when performing local estimation.

Partially Observed Parameters
The described formulation is a common one for distributed unconstrained optimization. However, in practice, we frequently have a decomposition of optimization variable into private and boundary subsets. The private part is solely optimized by the node owning this information and corresponding measurements. We express this part as possibly sparse vector θ i t . The boundary part can be optimized by several nodes and requires the information exchange with neighboring nodes. Thus, we transform the initial distributed problem into partially observed distributed problem. The process is schematically represented on Figure 2 and described in subsequent steps.  Figure 2a illustrates the common scheme of distributed optimization between two arbitrary nodes i and j. Both nodes optimize the whole vector and send it to the neighboring node. Next, we move to Figure 2b. Let X = {1, . . . , d} be a set of indices corresponding to the entries contained in θ i t . We divide this set into n subsets X 1 t , . . . , X n t such as

Decomposition
The latter means that the intersection between two arbitrary sets i and j is set X i,j t , which has a size ranging from 0 to d elements.
Consider arbitrarily chosen set of indices S t ⊆ X and vector s ∈ R d . We denote by B St = [e T ω1 , . . . , e T ω |S t | ] the selection matrix. Here and after, e l ∈ R d is the canonical basis vector that has a unit entry at the selected index l and zeros elsewhere and ω l ∈ S t . Then, we define a linear map: This linear map produces a sub-vector of s taking the entries which indices are contained in S. Then, it takes this sub-vector and restores initial vector dimension filling the rest of the entries by zeros.
Finally, we get a vector to be optimized by node i:

Communication Matrix
Communication matrix for the partially observed parameters can be adopted from : where X b = i∈N ,j∈N i X i,j t , e l ∈ R d is the canonical basis vector that has a unit entry at the selected index l and zeros elsewhere,Ḡ l t is a subgraph of G at time instant t associated with the parameter at index l.
The components of canonical vector can be formed in different ways. When communication constraints should be accounted for, we can artificially choose these components in deterministic or random manner. Also, these components may reflect the structure of the problem itself and appear naturally due to this factor (e.g., multiarea state estimation, target tracking with limited sensor range, etc.). The next subsection describes the randomized way of generating these components.

Random Components
Our previous works rely on a random sparsification strategy used to satisfy communication and sensing constraints. Here, we utilize the concept of compression operator, which includes sparsification as a special case and generalizes our previous approach.
Definition [Islamov et al., 2021]. A possibly randomized map C : R d → R d is a compression operator if there exists a constant ω ≥ 0 such that the following relations hold for all x ∈ R d : In particular, the compression operator has the form defined below. We use the similar strategy as before, but enhance our theoretical analysis through generalization based on the definition of compression operator.
Definition [Stich et al., 2018]. For a parameter 1 ≤ p ≤ d, the compression operator C p : where π ∈ R d is a random vector uniformly distributed on discrete set Ω d = [d] p , which denotes the set of all p element subsets of [d]. The variance parameter associated with this operator is ω = 1 − d p . We abbreviate C p (x) whenever the second argument is chosen uniformly at random.
Then, the canonical vectors and the corresponding subgraphs can be generated by applying compression operator C p (x) to the vector to be optimized/estimated. The next section combines the new formulation of communication matrix and accelerated distributed SPSA-based consensus algorithm.

Main Result
This section introduces a new algorithm with partially observed parameters starting with the distributed SPSAbased consensus algorithm followed by its accelerated version. After that, we end this section providing the convergence analysis.

Recap of SPSA-based consensus algorithm
(DSPSA) SPSA-based consensus algorithm is the Simultaneous Perturbation Stochastic Approximation equipped with a consensus-based procedure. In sequel, we describe its main components and steps.
Let ∆ i k , k = 1, 2, . . . , i ∈ N , be an observed sequence of independent random vectors in R d , called the simultaneous test perturbation, drawn from Bernoulli distribution. Each component of the vector independently takes value ± 1 √ d with probability 1 2 . Let us take fixed nonrandom initial vectors θ i 0 ∈ R d , positive step-size h, gain coefficient ω, and parameter β > 0. We consider the algorithm with two observations of distributed sub-functions f i ξt (θ) for each agent i ∈ N for constructing sequences of measurement points {x i t } and estimates { θ i t }: 5.2 Accelerated DSPSA with Partially Observed Parameters We modify the accelerated version of DSPSA presented in [Erofeeva and Granichin, 2023] by equipping it with the partially observed communication matrix.
Taking into account the assumptions from [Erofeeva and Granichin, 2023], where L is Lipschitz constant and µ is the constant related to strong convexity, we define a list of variables. At each node, we choose initial estimatê θ i 0 ∈ R d , and parameters γ i 0 > 0, h > 0, β > 0, η ∈ (0, µ), α i 0 ∈ (0, 1). We also define z i 0 =θ i 0 and H = h− h 2 L 2 and pick α i x ∈ (0, 1). At each k > 0, we find α k by solving the equation presented in the paper mentioned above as well as γ i k . We present an algorithm that requires two measurements of function f i ξt (·) taken subsequently. Using gradient approximation techniques, we produce estimates { θ i t } at k ≥ 1 and each node: where B k is an adjacency matrix corresponding to W k .

Convergence Analysis
Proposition 1: Letλ m be defined as then we obtain the convergence properties similar to Theorem 1 in [Erofeeva and Granichin, 2023].
Proof: The convergence of the proposed algorithm depends on the spectral properties of the underlying communication matrix. We can obtain the result of Theorem 1 from [Erofeeva and Granichin, 2023] by redefining the constant related to the graph properties. Hence, let us analyzeλ m = λ 1 2 max (W T W). Based on Lemma 2 in [Chezhegov et al., 2022], it follows that Substituting the redefined constant, we get the result of Theorem 1 in [Erofeeva and Granichin, 2023].

Simulations
We consider multi-sensor multi-target tracking problem fully described in [Granichin et al., 2020]. The sensor network consisting on n nodes spatially distributed over an area of interest tracks m targets. The sensor nodes are assumed to be static. Their state is represented by a position in 2D plane. The state-transition model of targets is defined as in (1). The sensors can measure the distance between their positions and the positions of targets. The goal of the sensor network is to estimate the unknown target positions based on the measured distances. The sensors are able to communicate with each other over possibly delayed channels to obtain a common solution.
We estimate the target positions using the proposed accelerated SPSA algorithm with partially observed parameters. We have set the following parameters of the algorithm: h = 0.08, ω = 1, β = 0.5, η = 0.95, α i x = 0.1, γ i 0 = 2, L = 2, µ = 2. The initial estimates θ i 0 are chosen randomly at each sensor. The states of the sensors are chosen randomly from interval [100; 120]. The number of sensors is n = 3, the number of targets is m = 10.
In the simulation, the new algorithm AP-DSPSA is compared with the previous one from [Granichin et al., 2020]. Figures 3, 4, 5 show the cumulative tracking error in different delayed scenarios on the logarithmic scale. The delays are added to the new algorithm only. We generate them randomly setting the max delay equal to 10, 50 and 150 iterations, correspondingly. As can be seen, the delays influence the convergence time. However, the algorithm is still able to achieve the accuracy level of the non-delayed version.

Conclusion
In this paper, we consider non-stationary distributed optimization with partially observed parameters. We formulate this partial observability as time-varying communication matrix defined for each parameter separately. We propose the new AP-DSPSA algorithm combining the accelerated SPSA and the described communication scheme as well as show its theoretical properties. The simulation validates the proposed algorithm in target tracking problem over delayed channels.