# Static Parameter Estimation using Kalman Filtering and Proximal Operators

## Vivak Patel

Department of Statistics, University of Chicago
December 2, 2015

# Acknowledge

## Mihai Anitescu

• Senior Computational Mathematician in LANS
• Part-time Professor at University of Chicago

# Stationary Parameters

Problem.

• Observed Pairs $(Y_t,X_t) \in \mathbb{R} \times \mathbb{R}^n$ where $t=1,2,\ldots$
• Data Generation Process: $$\mathbb{P}(Y_t \vert X_t) = f(Y_t, X_t)$$
• Estimate $f$, which is stationary (changes on long time scales)

# Stationary Parameters

Nonparametric Density Estimation

• Assume some level of smoothness of $f$
• Use orthonormal bases of function spaces, or use moving average
• Rate of Convergence (Curse of Dimensionality): $$ASIME \sim N^{-\frac{2s}{2s+n}}$$
• Replace smoothness assumption with sparsity

# Stationary Parameters

Parametric Density Estimation

• Assume a particular form for $f$ depending on a parameter $\beta$ $$\mathbb{P}\left(Y_t \vert X_t, \beta \right) = f(Y_t,X_t,\beta)$$
• Estimate $\beta$
• Rate of Convergence: $$MSE \sim N^{-1}$$

# Optimization

For statistical inference, the objective function is the log-likelihood: $$F_N(\beta) = \sum_{t=1}^N \log f(Y_t,X_t,\beta)$$ Fisher (1922), Cramér (1946), Le Cam (1970), Hájek (1972)

Suppose the data is generated by some fixed $\beta^*$. Let $\beta_N = \arg\min F_N(\beta)$, and $$V^* = \mathbb{E}\left[\nabla_{\beta} \log f(Y,X,\beta^*) \nabla^T_{\beta} \log f(Y,X,\beta^*) \right]$$ If some regularity conditions on $f$ hold then $$\sqrt{N}(\beta_N - \beta^*) \Rightarrow \mathcal{N}\left(0, V^* \right)$$

van der Vaart (2000)

# Optimization

Goals of Optimization Method

• Fast Rate of Convergence
• Computational Stability
• Well-defined stop condition
• Low computational requirements (floating point operations, memory)

Classical Methods for $F_N(\beta)$

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(n^2 N)$ $\mathcal{O}(n^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(n^2 + nN)$ $\mathcal{O}(n^2)$ Super-Linear Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(nN)$ $\mathcal{O}(n)$ Linear Deterministic Moderate

Moral: If $N$ is a large, a single pass through the data set before calculating a new iterate is too expensive.

Basic Idea: Minimize $F_N(\beta) = \sum_{t=1}^N \log f(Y_t,X_t,\beta)$ using: $$\beta_{k+1}^{SGD} = \beta_k^{SGD} - \alpha_k \nabla \log f(Y_t, X_t, \beta_k^{SGD}) =: \beta_k^{SGD} - \alpha_k \nabla l_{t,k}^{SGD}$$

History

Least Mean Squares Stochastic Gradient Descent (SGD)
• Cotes (early 1700s)
• Legendre (1805)
• Gauss (1806)
• Macchi & Eweda (1983)
• Gerencsér (1995)
• Robbins & Monro (1951)
• Neveu (1975)
• Murata (1998)

Characterization per iteration

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(n^2 N)$ $\mathcal{O}(n^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(n^2 + nN)$ $\mathcal{O}(n^2)$ Super-Linear Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(nN)$ $\mathcal{O}(n)$ Linear Deterministic Moderate
Alt. Full Gradient-SGD $N+m$ $0$ $\mathcal{O}(N+m)$ $\mathcal{O}(n)$ Linear Probabilistic Moderate
SGD $1$ $0$ $\mathcal{O}(n)$ $\mathcal{O}(n)$ Sub-Linear None Severe

Alternative Strategy?
Without increasing number of gradient evaluations, can we:

• Increase Rate of Convergence?
• Reduce sensitivity to conditoning?

Yes, by increasing computational resources used per iteration

# Proximal Operator

Proximal Form of SGD $$\beta_{k+1}^{SGD} = \arg\min_{\beta} \left\lbrace l_{t,k}^{SGD} + (\beta-\beta_k^{SGD})^T \nabla l_{t,k}^{SGD} + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k^{SGD} \right\Vert^2 \right\rbrace$$

Levenberg (1944), Marquardt (1963) Improvement \begin{aligned} \beta_{k+1}^{LM} &= \arg\min_{\beta} \left\lbrace \frac{1}{2} \left( l_{t,k}^{LM} + (\beta - \beta_k^{LM})^T \nabla l_{t,k}^{LM} \right)^2 + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k^{LM} \right\Vert^2 \right\rbrace \\ \beta_{k+1}^{LM} &= \beta_{k}^{LM} - \left( \nabla l_{t,k}^{LM} \nabla^T l_{t,k}^{LM} + \alpha_k^{-1} I \right)^{-1} \left(\nabla l_{t,k}^{LM} \right) l_{t,k}^{LM} \end{aligned}

Why is this an improvement?

# Proximal Operator

Recall: $$\beta_{k+1}^{LM} = \beta_{k}^{LM} - \left( \nabla l_{t,k}^{LM} \nabla^T l_{t,k}^{LM} + \alpha_k^{-1} I \right)^{-1} \left(\nabla l_{t,k}^{LM} \right) l_{t,k}^{LM}$$

Second Bartlett Identity: $$\mathbb{E}_{\beta_k^{LM}}\left[\nabla l_{t,k}^{LM} \nabla^T l_{t,k}^{LM} \right] = \mathbb{E}_{\beta_k^{LM}} \left[ \nabla^2 l_{t,k}\right]$$ holds under some regularity conditions on $f$.

Moral: We are using a regularized estimate of the hessian/covariance.

Can we do better still?

# Proximal Operator

Levenberg (1944), Marquardt (1963) Improvement \begin{aligned} \beta_{k+1}^{LM} &= \arg\min_{\beta} \left\lbrace \frac{1}{2} \left( l_{t,k}^{LM} + (\beta - \beta_k^{LM})^T \nabla l_{t,k}^{LM} \right)^2 + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k^{LM} \right\Vert^2 \right\rbrace \\ \beta_{k+1}^{LM} &= \beta_{k}^{LM} - \left( \nabla l_{t,k}^{LM} \nabla^T l_{t,k}^{LM} + \alpha_k^{-1} I \right)^{-1} \left(\nabla l_{t,k}^{LM} \right) l_{t,k}^{LM} \end{aligned}

Possible Rescaling Improvement? \begin{aligned} \beta_{k+1} &= \arg\min_{\beta} \left\lbrace \frac{1}{2} \left( l_{t,k} + (\beta - \beta_k)^T \nabla l_{t,k} \right)^2 + \frac{1}{2} \left\Vert \beta - \beta_k \right\Vert^2_{M_k^{-1}} \right\rbrace \\ \beta_{k+1} &= \beta_{k} - \left( \nabla l_{t,k} \nabla^T l_{t,k} + M_k^{-1}\right)^{-1} \left(\nabla l_{t,k} \right) l_{t,k} \end{aligned}

# Proximal Operator

Recall: \begin{aligned} \beta_{k+1} &= \arg\min_{\beta} \left\lbrace \frac{1}{2} \left( l_{t,k} + (\beta - \beta_k)^T \nabla l_{t,k} \right)^2 + \frac{1}{2} \left\Vert \beta - \beta_k \right\Vert^2_{M_k^{-1}} \right\rbrace \\ \beta_{k+1} &= \beta_{k} - \left( \nabla l_{t,k} \nabla^T l_{t,k} + M_k^{-1}\right)^{-1} \left(\nabla l_{t,k} \right) l_{t,k} \end{aligned}

Optimizer Interpretation: $M_k$ should be the inverse Hessian at $\beta_k$.

First Generation Stochastic Quasi-Newton:

• No improvement in rate of convergence
• No stop condition
• High sensitivity to conditioning

Byrd, Hansen, Nocedal & Singer (arXiv, 2015)

# Proximal Operator

Recall: \begin{aligned} \beta_{k+1} &= \arg\min_{\beta} \left\lbrace \frac{1}{2} \left( l_{t,k} + (\beta - \beta_k)^T \nabla l_{t,k} \right)^2 + \frac{1}{2} \left\Vert \beta - \beta_k \right\Vert^2_{M_k^{-1}} \right\rbrace \\ \beta_{k+1} &= \beta_{k} - \left( \nabla l_{t,k} \nabla^T l_{t,k} + M_k^{-1}\right)^{-1} \left(\nabla l_{t,k} \right) l_{t,k} \end{aligned}

Statistician Interpretation: $M_k$ should estimate the inverse covariance.

Kalman Filter: simultaneously estimates parameter and its covariance matrix.

Kalman (1960)

# Kalman Filter

Control Theoretic Derivation

Suppose $$\beta_{k+1} = \beta_k - G_{k+1}l_{k,t}$$ and find $G_{k+1}$ which minimizes $$\mathbb{E}\left[\left. \left\Vert \beta_{k+1} - \beta^* \right\Vert^2 \right\Vert X_{1},\ldots,X_{k+1} \right]$$

Proximal Derivation

Given the "variance" of $l_{k,t}$ conditioned on $X_t$, $\sigma_t^2$, and $\mathcal{M}_k := \mathbb{E}\left[ \left. (\beta_{k} - \beta^*)(\beta_k - \beta^*)^T \right\vert X_{1},\ldots,X_{k} \right]$ $$\beta_{k+1} = \arg\min_{\beta} \left\lbrace \frac{1}{2 \sigma_t^2} \left( l_{t,k} + (\beta - \beta_k)^T \nabla l_{t,k} \right)^2 + \frac{1}{2} \left\Vert \beta - \beta_k \right\Vert^2_{\mathcal{M}_k^{-1}} \right\rbrace$$

# Kalman Filter

Update for $\mathcal{M}_k$: $$\mathcal{M}_{k+1}^{-1} = \frac{1}{\sigma_{t+1}^2}\nabla l_{k,t} \nabla^T l_{k,t} + \mathcal{M}_{k}^{-1}$$

Problem: neither $\sigma_t^2$ nor $\mathcal{M}_0$ are known a priori

Possible Solution: replace $\sigma_t^2$ with other sequence $\gamma_k^2$, and replace $\mathcal{M}_0$ with $M_0$ and generate $$M_{k+1}^{-1} = \frac{1}{\gamma_{k+1}^2}\nabla l_{k,t} \nabla^T l_{k,t} + M_{k}^{-1}$$

Questions

• What values of $\gamma_k^2$ will result in $\beta_{k}$ converging to $\beta^*$?
• What values of $\gamma_k^2$ will result in $M_k$ approximating $\mathcal{M}_k$?

# Kalman Filter

Summary of Kalman Filtering Theory

• Randomness in the model is not assumed to exist. Thus, $\sigma_t^2 = 0$ and $\gamma_k^2$ could be picked based rate of convergence needs.
• There is a strict focus on dynamic parameter estimation. Approximating $\mathcal{M}_k$ is ignored to prove linear convergence rate.

References

• Johnstone, Johnson, Bitmead & Anderson (1982)
• Bittanti, Bolzern & Campi (1990)
• Parkum, Poulsen & Holst (1992)
• Cao & Schwartz (2003)

# Kalman-based SGD

Assumptions

• Linear Model $l_{k,t} = (Y_t - X_t^T \beta_k)^2/2$ with $\mathbb{E}\left[l_{*,t}\right] < \infty$
• Independence and Identical Distribution of $(Y_t,X_t)$
• Regularity. $\mathbb{E}\left[ \left\Vert X_1 \right\Vert_2^2 \right] < \infty$
• Uniqueness. For all unit vectors $v \in \mathbb{R}^n$, $\mathbb{P}\left[ \left\vert X_1^T v \right\vert = 0\right] < \infty$.

In the noise free case, $Y_t = X_t^T\beta^*$. kSGD will calculate the vector $\beta^*$ once $n$ linearly independent examples are assimilated. (Modified Gram-Schmidt)
In the noisy case, if $0 < \inf_{k} \gamma_k^2 \leq \sup_{k} \gamma_k^2 < \infty$ then almost surely $$\mathbb{E} \left[ \left. \left\Vert \beta_k - \beta^* \right\Vert \right\vert X_1,\ldots,X_{k}\right] \to 0$$
For every $\epsilon > 0$, asymptotically almost surely: $$\frac{1 + \epsilon}{\inf \gamma_k^2} M_k \succeq \frac{1}{\sigma^2} \mathcal{M}_k \succeq \frac{1 - \epsilon}{\sup \gamma_k^2} M_k$$

# Kalman-based SGD

CMS Data Set ($n = 31$, $N = 2.8$ million)

# Kalman-based SGD

CMS Data Set ($n = 31$, $N = 2.8$ million)

# Kalman-based SGD

Characterization per iteration

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(n^2 N)$ $\mathcal{O}(n^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(n^2 + nN)$ $\mathcal{O}(n^2)$ Super-Linear Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(nN)$ $\mathcal{O}(n)$ Linear Deterministic Moderate
Alt. Full Gradient-SGD $N+m$ $0$ $\mathcal{O}(N+m)$ $\mathcal{O}(n)$ Linear Probabilistic Moderate
kSGD $1$ $0$ $\mathcal{O}(n^2)$ $\mathcal{O}(n^2)$ >Sub-Linear Almost Sure Mild
SGD $1$ $0$ $\mathcal{O}(n)$ $\mathcal{O}(n)$ Sub-Linear None Severe

Thank You

/