Static Parameter Estimation using Kalman Filtering and Proximal Operators

Vivak Patel

Department of Statistics, University of Chicago
December 2, 2015

Acknowledge

Mihai Anitescu

Madhukanta Patel (June, 8 1924 - Nov. 21, 2015)

Outline

I. Stationary Parameter Estimation (SPE)

II. Optimization

III. Proximal Operators

IV. Kalman Filtering Theory

V. Kalman-based SGD

Stationary Parameters

Problem.

Stationary Parameters

Nonparametric Density Estimation

Stationary Parameters

Parametric Density Estimation

Optimization

For statistical inference, the objective function is the log-likelihood: $$F_N(\beta) = \sum_{t=1}^N \log f(Y_t,X_t,\beta)$$ Fisher (1922), Cramér (1946), Le Cam (1970), Hájek (1972)

Suppose the data is generated by some fixed $\beta^*$. Let $\beta_N = \arg\min F_N(\beta)$, and $$V^* = \mathbb{E}\left[\nabla_{\beta} \log f(Y,X,\beta^*) \nabla^T_{\beta} \log f(Y,X,\beta^*) \right]$$ If some regularity conditions on $f$ hold then $$ \sqrt{N}(\beta_N - \beta^*) \Rightarrow \mathcal{N}\left(0, V^* \right) $$

van der Vaart (2000)

Optimization

Goals of Optimization Method

  • Fast Rate of Convergence
  • Computational Stability
  • Well-defined stop condition
  • Low computational requirements (floating point operations, memory)

Classical Methods for $F_N(\beta)$

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(n^2 N)$ $\mathcal{O}(n^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(n^2 + nN)$ $\mathcal{O}(n^2)$ Super-Linear Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(nN)$ $\mathcal{O}(n) $ Linear Deterministic Moderate

Moral: If $N$ is a large, a single pass through the data set before calculating a new iterate is too expensive.

Stochastic Gradients

Basic Idea: Minimize $F_N(\beta) = \sum_{t=1}^N \log f(Y_t,X_t,\beta)$ using: $$ \beta_{k+1}^{SGD} = \beta_k^{SGD} - \alpha_k \nabla \log f(Y_t, X_t, \beta_k^{SGD}) =: \beta_k^{SGD} - \alpha_k \nabla l_{t,k}^{SGD}$$

History

Least Mean Squares Stochastic Gradient Descent (SGD)
  • Cotes (early 1700s)
  • Legendre (1805)
  • Gauss (1806)
  • Macchi & Eweda (1983)
  • Gerencsér (1995)
  • Robbins & Monro (1951)
  • Neveu (1975)
  • Murata (1998)

Stochastic Gradients

Characterization per iteration

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(n^2 N)$ $\mathcal{O}(n^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(n^2 + nN)$ $\mathcal{O}(n^2)$ Super-Linear Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(nN)$ $\mathcal{O}(n) $ Linear Deterministic Moderate
Alt. Full Gradient-SGD $N+m$ $0$ $\mathcal{O}(N+m)$ $\mathcal{O}(n) $ Linear Probabilistic Moderate
SGD $1$ $0$ $\mathcal{O}(n)$ $\mathcal{O}(n) $ Sub-Linear None Severe


Alternative Strategy?
Without increasing number of gradient evaluations, can we:

Yes, by increasing computational resources used per iteration

Proximal Operator

Proximal Form of SGD $$\beta_{k+1}^{SGD} = \arg\min_{\beta} \left\lbrace l_{t,k}^{SGD} + (\beta-\beta_k^{SGD})^T \nabla l_{t,k}^{SGD} + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k^{SGD} \right\Vert^2 \right\rbrace$$

Levenberg (1944), Marquardt (1963) Improvement $$\begin{aligned} \beta_{k+1}^{LM} &= \arg\min_{\beta} \left\lbrace \frac{1}{2} \left( l_{t,k}^{LM} + (\beta - \beta_k^{LM})^T \nabla l_{t,k}^{LM} \right)^2 + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k^{LM} \right\Vert^2 \right\rbrace \\ \beta_{k+1}^{LM} &= \beta_{k}^{LM} - \left( \nabla l_{t,k}^{LM} \nabla^T l_{t,k}^{LM} + \alpha_k^{-1} I \right)^{-1} \left(\nabla l_{t,k}^{LM} \right) l_{t,k}^{LM} \end{aligned}$$

Why is this an improvement?

Proximal Operator

Recall: $$\beta_{k+1}^{LM} = \beta_{k}^{LM} - \left( \nabla l_{t,k}^{LM} \nabla^T l_{t,k}^{LM} + \alpha_k^{-1} I \right)^{-1} \left(\nabla l_{t,k}^{LM} \right) l_{t,k}^{LM}$$

Second Bartlett Identity: $$ \mathbb{E}_{\beta_k^{LM}}\left[\nabla l_{t,k}^{LM} \nabla^T l_{t,k}^{LM} \right] = \mathbb{E}_{\beta_k^{LM}} \left[ \nabla^2 l_{t,k}\right]$$ holds under some regularity conditions on $f$.


Moral: We are using a regularized estimate of the hessian/covariance.

Can we do better still?

Proximal Operator

Levenberg (1944), Marquardt (1963) Improvement $$\begin{aligned} \beta_{k+1}^{LM} &= \arg\min_{\beta} \left\lbrace \frac{1}{2} \left( l_{t,k}^{LM} + (\beta - \beta_k^{LM})^T \nabla l_{t,k}^{LM} \right)^2 + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k^{LM} \right\Vert^2 \right\rbrace \\ \beta_{k+1}^{LM} &= \beta_{k}^{LM} - \left( \nabla l_{t,k}^{LM} \nabla^T l_{t,k}^{LM} + \alpha_k^{-1} I \right)^{-1} \left(\nabla l_{t,k}^{LM} \right) l_{t,k}^{LM} \end{aligned}$$

Possible Rescaling Improvement? $$\begin{aligned} \beta_{k+1} &= \arg\min_{\beta} \left\lbrace \frac{1}{2} \left( l_{t,k} + (\beta - \beta_k)^T \nabla l_{t,k} \right)^2 + \frac{1}{2} \left\Vert \beta - \beta_k \right\Vert^2_{M_k^{-1}} \right\rbrace \\ \beta_{k+1} &= \beta_{k} - \left( \nabla l_{t,k} \nabla^T l_{t,k} + M_k^{-1}\right)^{-1} \left(\nabla l_{t,k} \right) l_{t,k} \end{aligned}$$

Proximal Operator

Recall: $$\begin{aligned} \beta_{k+1} &= \arg\min_{\beta} \left\lbrace \frac{1}{2} \left( l_{t,k} + (\beta - \beta_k)^T \nabla l_{t,k} \right)^2 + \frac{1}{2} \left\Vert \beta - \beta_k \right\Vert^2_{M_k^{-1}} \right\rbrace \\ \beta_{k+1} &= \beta_{k} - \left( \nabla l_{t,k} \nabla^T l_{t,k} + M_k^{-1}\right)^{-1} \left(\nabla l_{t,k} \right) l_{t,k} \end{aligned}$$

Optimizer Interpretation: $M_k$ should be the inverse Hessian at $\beta_k$.

First Generation Stochastic Quasi-Newton:

Byrd, Hansen, Nocedal & Singer (arXiv, 2015)

Proximal Operator

Recall: $$\begin{aligned} \beta_{k+1} &= \arg\min_{\beta} \left\lbrace \frac{1}{2} \left( l_{t,k} + (\beta - \beta_k)^T \nabla l_{t,k} \right)^2 + \frac{1}{2} \left\Vert \beta - \beta_k \right\Vert^2_{M_k^{-1}} \right\rbrace \\ \beta_{k+1} &= \beta_{k} - \left( \nabla l_{t,k} \nabla^T l_{t,k} + M_k^{-1}\right)^{-1} \left(\nabla l_{t,k} \right) l_{t,k} \end{aligned}$$

Statistician Interpretation: $M_k$ should estimate the inverse covariance.

Kalman Filter: simultaneously estimates parameter and its covariance matrix.

Kalman (1960)

Kalman Filter

Control Theoretic Derivation

Suppose $$\beta_{k+1} = \beta_k - G_{k+1}l_{k,t}$$ and find $G_{k+1}$ which minimizes $$ \mathbb{E}\left[\left. \left\Vert \beta_{k+1} - \beta^* \right\Vert^2 \right\Vert X_{1},\ldots,X_{k+1} \right]$$

Proximal Derivation

Given the "variance" of $l_{k,t}$ conditioned on $X_t$, $\sigma_t^2$, and $\mathcal{M}_k := \mathbb{E}\left[ \left. (\beta_{k} - \beta^*)(\beta_k - \beta^*)^T \right\vert X_{1},\ldots,X_{k} \right]$ $$\beta_{k+1} = \arg\min_{\beta} \left\lbrace \frac{1}{2 \sigma_t^2} \left( l_{t,k} + (\beta - \beta_k)^T \nabla l_{t,k} \right)^2 + \frac{1}{2} \left\Vert \beta - \beta_k \right\Vert^2_{\mathcal{M}_k^{-1}} \right\rbrace$$

Kalman Filter

Update for $\mathcal{M}_k$: $$\mathcal{M}_{k+1}^{-1} = \frac{1}{\sigma_{t+1}^2}\nabla l_{k,t} \nabla^T l_{k,t} + \mathcal{M}_{k}^{-1}$$

Problem: neither $\sigma_t^2$ nor $\mathcal{M}_0$ are known a priori

Possible Solution: replace $\sigma_t^2$ with other sequence $\gamma_k^2$, and replace $\mathcal{M}_0$ with $M_0$ and generate $$M_{k+1}^{-1} = \frac{1}{\gamma_{k+1}^2}\nabla l_{k,t} \nabla^T l_{k,t} + M_{k}^{-1}$$

Questions

Kalman Filter

Summary of Kalman Filtering Theory

References

Kalman-based SGD

Assumptions

In the noise free case, $Y_t = X_t^T\beta^*$. kSGD will calculate the vector $\beta^*$ once $n$ linearly independent examples are assimilated. (Modified Gram-Schmidt)
In the noisy case, if $0 < \inf_{k} \gamma_k^2 \leq \sup_{k} \gamma_k^2 < \infty$ then almost surely $$ \mathbb{E} \left[ \left. \left\Vert \beta_k - \beta^* \right\Vert \right\vert X_1,\ldots,X_{k}\right] \to 0$$
For every $\epsilon > 0$, asymptotically almost surely: $$ \frac{1 + \epsilon}{\inf \gamma_k^2} M_k \succeq \frac{1}{\sigma^2} \mathcal{M}_k \succeq \frac{1 - \epsilon}{\sup \gamma_k^2} M_k $$

Kalman-based SGD

CMS Data Set ($n = 31$, $N = 2.8$ million)

Kalman-based SGD

CMS Data Set ($n = 31$, $N = 2.8$ million)

Kalman-based SGD

Characterization per iteration

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(n^2 N)$ $\mathcal{O}(n^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(n^2 + nN)$ $\mathcal{O}(n^2)$ Super-Linear Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(nN)$ $\mathcal{O}(n) $ Linear Deterministic Moderate
Alt. Full Gradient-SGD $N+m$ $0$ $\mathcal{O}(N+m)$ $\mathcal{O}(n) $ Linear Probabilistic Moderate
kSGD $1$ $0$ $\mathcal{O}(n^2)$ $\mathcal{O}(n^2) $ >Sub-Linear Almost Sure Mild
SGD $1$ $0$ $\mathcal{O}(n)$ $\mathcal{O}(n) $ Sub-Linear None Severe

Thank You

/