Regression Estimation Under Resource Constraints: Fisher, Bartlett & Kalman

Vivak Patel

Department of Statistics, University of Chicago
January 15, 2016

Acknowledge

Mihai Anitescu

Madhukanta Patel

Outline

I. Stationary Parameter Estimation (SPE)

II. Problem Class, Bartlett & Fisher

III. Optimization

IV. Stochastic Gradients

V. Proximal Operators & Kalman

VI. Kalman-based SGD

Stationary Parameters

Problem.

Example Problem. Linear Regression $$ \frac{1}{N} \sum_{t=1}^N c(Y_t, \mu(\beta,X_t)) = \frac{1}{N} \sum_{t=1}^N(Y_t - \mu(\beta,X_t))^2 = \frac{1}{N} \sum_{t=1}^N(Y_t - \beta^T X_t)^2$$

Stationary Parameters

Examples of Statistical Criterion Functions

Problem Class

First Bartlett Identity $$\exists \beta^* ~\text{s.t.}~ \mathbb{E}\left[ \left. \nabla c^* \right\vert \sigma(X) \right] = 0 $$ Second Bartlett Identity $$\exists \beta^* ~\text{s.t.}~ \mathbb{E}\left[ \left. \nabla^2 c^* \right\vert \sigma(X)\right] = \mathbb{E}\left[\left. \nabla c^* \nabla^T c^* \right\vert \sigma(X)\right]$$

Fisher Criteria $$\exists V: \mathbb{R} \to \mathbb{R}_{> 0} ~\text{s.t.}~ \mathbb{E}\left[ \left. \nabla c^* \nabla^T c^* \right\vert \sigma(X)\right] = \frac{1}{V(\mu(\beta^*,X))}\nabla \mu(\beta^*,X) \nabla^T \mu(\beta^*,X)$$

Problem Class: Example

Non-Linear Logistic Regression
Suppose $Y_t \in \lbrace 0,1\rbrace$ and $\mu: \mathbb{R}^p \times \mathbb{R}^p \to [0,1]$. $$ \mathbb{P}\left[ Y_t \vert X_t\right] = \left(\mu(\beta,X_t)\right)^{Y_t}\left( 1 - \mu(\beta,X_t)\right)^{1-Y_t}$$ Taking the negative logarithm: $$ c_t = -Y_t \log\left[\frac{\mu(\beta,X_t)}{1 - \mu(\beta,X_t)} \right] - \log\left[1-\mu(\beta,X_t)\right]$$ Taking the derivative with respect to $\beta$: $$ \nabla c_t = \frac{\partial c_t}{\partial \mu } \nabla \mu = - \frac{Y_t - \mu}{\mu(1-\mu)} \nabla \mu$$ Notice that the first Bartlett Identity is satisfied since $\mathbb{E}\left[ \left. \frac{\partial c_t}{\partial \mu} \right\vert \sigma(X_t)\right] = 0 $.

Problem Class: Example

Non-Linear Logistic Regression
Now taking the second derivative with respect to $\beta$: $$\begin{aligned} \nabla^2 c_t &= \left(\nabla \frac{\partial c_t}{\partial \mu}\right) \nabla^T \mu + \frac{\partial c_t}{\partial \mu} \nabla^2 \mu \\ &= \frac{\partial^2 c_t}{\partial \mu^2} \nabla \mu \nabla^T \mu + \frac{\partial c_t}{\partial \mu} \nabla^2 \mu \\ &= \left(\frac{1}{\mu(1-\mu)} + \frac{(Y_t - \mu)(1-2\mu)}{\mu^2(1-\mu)^2} \right)\nabla \mu \nabla^T \mu + \frac{\partial c_t}{\partial \mu} \nabla^2 \mu \end{aligned}$$ Taking conditional expectation, we have that Fisher's criteria is satisfied. $$\mathbb{E}\left[ \nabla^2 c_t \vert \sigma(X_t) \right] = \frac{1}{\mu(1-\mu)}\nabla \mu \nabla^T \mu $$ Moreover, since $Var(Y_t) = \mu (1-\mu)$, the second Bartlett identity is satisfied: $ \mathbb{E}\left[ \nabla c_t \nabla^T c_t \right] = \mathbb{E}\left[ \nabla^2 c_t \vert \sigma(X_t) \right]$.

Problem Class: More Examples

Problem Class Criterion Functions

See Mykland 1999 for more information

Optimization

Classical Optimization Techniques

Classical Performance Metrics

Optimization

Classical Metrics for Classical Methods

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(p^2 N)$ $\mathcal{O}(p^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(p^2 + nN)$ $\mathcal{O}(p^2)$ Super-Linear Deterministic Mild
IRLS $N$ $0$ $\mathcal{O}(p^2 N)$ $\mathcal{O}(p^2)$ Super-Linear (?) Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(pN)$ $\mathcal{O}(p) $ Linear Deterministic Moderate

Question 1. For $p$ and $N$ large, do these metrics have enough information?

Question 2. If not, what metrics should we add or reconsider to make better algorithmic choices?

Optimization

Suggested Additional Metrics under Resource Constraints

Additional Metrics for Classical Methods

p-Feasibility N-Feasibility Efficiency
Newton Prohibitive Prohibitive $N$
Quasi-Newton Prohibitive Moderate $N$
L-Quasi-Newton Moderate Moderate $N$
IRLS Prohibitive Prohibitive $N$
Gradient Descent Mild Mild $N$

Stochastic Gradients

Basic Idea: Minimize $\frac{1}{N}\sum_{t=1}^N c(Y_t,\mu(\beta,X_t))$ using: $$ \beta_{k+1}^{SGD} = \beta_k^{SGD} - \alpha_k \nabla_\beta c(Y_t, \mu(\beta_k^{SGD}, X_t))$$

History

Simplifying Notation $$ c_{t} = c(Y_t, \mu(\beta, X_t)) $$ $$ c_{t,k}^{SGD} = c(Y_t, \mu(\beta_k^{SGD},X_t)) $$ $$ \nabla c_{t,k}^{SGD} = \nabla_\beta c(Y_t, \mu(\beta_{k}^{SGD},X_t))$$

Stochastic Gradients

Classical Metrics

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(p^2 N)$ $\mathcal{O}(p^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(p^2 + nN)$ $\mathcal{O}(p^2)$ Super-Linear Deterministic Mild
L-Quasi-Newton $N$ $0$ $\mathcal{O}(pm + nN)$ $\mathcal{O}(pm)$ Super-Linear(?) Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(pN)$ $\mathcal{O}(p) $ Linear Deterministic Moderate
SGD $1$ $0$ $\mathcal{O}(p)$ $\mathcal{O}(p) $ Sub-Linear None Severe

Additional Metrics

p-Feasibility N-Feasibility Efficiency
Newton Prohibitive Prohibitive $N$
Quasi-Newton Prohibitive Moderate $N$
L-Quasi-Newton Moderate Moderate $N$
IRLS Prohibitive Prohibitive $N$
Gradient Descent Mild Mild $N$
SGD Mild Very Mild $1$

Stochastic Gradients

Incremental Optimization Methods

  • Goal is to calculate $\hat{\beta}_N$ within some error.
  • Treat current available data as the population
  • SGD, Amari 2000, Schraudolph 2007, Roux 2012, Richtárik 2013, Sohl-Dickstein 2013, Zhang 2014

Incremental Estimator Methods

  • Goal is to find a computationally inexpensive estimator of the minimizer of $\mathbb{E}\left[c(Y,\mu(\beta,X))\right]$
  • Treat currently available data as a sample
  • SGD, Patel 2015

The remainder of this talk will discuss going from SGD to a generalization of Patel 2015.

Proximal Operator

Recall $$\beta_{k+1}^{SGD} = \beta_{k}^{SGD} - \alpha_k \nabla c_{t,k}^{SGD}$$

Proximal Form of SGD $$\beta_{k+1}^{SGD} = \arg\min \left\lbrace c_{t,k}^{SGD} + (\beta-\beta_k^{SGD})^T \nabla c_{t,k}^{SGD} + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k^{SGD} \right\Vert^2: \beta \in \mathbb{R}^p \right\rbrace$$

Non-approximative Form $$\beta_{k+1} \in \arg\min \left\lbrace c_t + \frac{1}{2\alpha_k}\left\Vert \beta - \beta_{k} \right\Vert^2: \beta \in \mathbb{R}^p \right\rbrace$$

Proximal Operator

Reweighted Form

Let $\mathcal{M}_k = \mathbb{E}\left[ \left. (\beta_k - \beta^*)(\beta_k - \beta^*)^T \right\vert \sigma(X_1,\ldots,X_t) \right]$. Consider the incrementation given by $$\beta_{k+1} \in \arg\min \left\lbrace c_t + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k \right\Vert_{\mathcal{M}_k^{-1}}^2: \beta \in \mathbb{R}^p \right\rbrace$$ Here, as $\beta_k$ becomes more accurate, the observation $(Y_t,X_t)$ will become less relevant.

Example: Simple Linear Regression

We have that $2c_t = \left\Vert Y_t - X_t^T \beta \right\Vert^2$ and $V(\mu) = \sigma^2 > 0$. $$ \beta_{k+1} \in \arg\min \left\lbrace \frac{1}{2\sigma^2} \left\Vert Y_t - X_t^T \beta \right\Vert^2 + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k \right\Vert_{\mathcal{M}_k^{-1}}^2: \beta \in \mathbb{R}^p \right\rbrace$$

Proximal Operators

Problems with Reweighted Form

  1. $\mathcal{M}_k$ is certainly unknown.
  2. The form of $c_t$ may make calculating $\beta_{k+1}$ expensive compared to communication costs.

Estimating $\mathcal{M}_k$ (Kalman 1960)

Kalman Filter & Theory

Kalman Filter $$\beta_{k+1} = \arg\min \left\lbrace \frac{1}{2\sigma_{k+1}^2} \left\Vert Y_t - \mu_{k} - \nabla^T \mu_{k}(\beta-\beta_k) \right\Vert^2 + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k \right\Vert_{M_k^{-1}}^2: \beta \in \mathbb{R}^p \right\rbrace$$ $$M_{k+1} = \left(I - \frac{M_{k}\nabla \mu_{k} \nabla^T \mu_{k}}{\sigma_{k+1}^2 + \nabla^T \mu_{k} M_{k}\nabla \mu_k} \right) M_k$$

Bayesian "Convergence" Result. If $M_0$ and $\sigma_k^2$ are correct, then $M_k$ will be correct and $\beta_k \to \beta^*$.

Problem. $M_0$ and $\sigma_k^2$ will never be known in real problems.

Kalman Filter & Theory

Control Theoretic Assumptions.

Control Theory Results.

Problems.

Kalman Filter & Theory

(Our) Probabilistic Assumptions.

Probabilistic Results.

Kalman Filter & Experiment

Data Set

Linear Model

Experiment

Kalman Filter & Experiment

Convergence of Parameter Estimate.

Kalman Filter & Experiment

Convergence of Estimated Covariance.

Kalman Filter & Metrics

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(p^2 N)$ $\mathcal{O}(p^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(p^2 + nN)$ $\mathcal{O}(p^2)$ Super-Linear Deterministic Mild
L-Quasi-Newton $N$ $0$ $\mathcal{O}(pm + nN)$ $\mathcal{O}(pm)$ Super-Linear(?) Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(pN)$ $\mathcal{O}(p) $ Linear Deterministic Moderate
Incremental Optimization $N+m$ $0$ $\mathcal{O}(p(N+m))$ $\mathcal{O}(p) $ Linear Probabilistic Moderate

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Kalman Filter $1$ $0$ $\mathcal{O}(p^2)$ $\mathcal{O}(p^2) $ Sub-Linear Almost Sure Mild
SGD $1$ $0$ $\mathcal{O}(p)$ $\mathcal{O}(p) $ Sub-Linear None Severe

Kalman Filter & Metrics

p-Feasibility N-Feasibility Efficiency
Newton Prohibitive Prohibitive $N$
Quasi-Newton Prohibitive Moderate $N$
L-Quasi-Newton Moderate Moderate $N$
IRLS Prohibitive Prohibitive $N$
Gradient Descent Mild Mild $N$
Incremental Optimization Varies Varies $(N+m)/2$

p-Feasibility N-Feasibility Efficiency
Kalman Filter Prohibitive Very Mild $1$
SGD Mild Very Mild $1$

Kalman-based SGD

Recall: Problems with Reweighted Form

  1. True covariance is unknown. Estimate with $M_k$.
  2. The form of $c_t$ may make calculating $\beta_{k+1}$ expensive compared to communication costs.

Example:El Niño
We want to model atmospheric pressure $Y$ using the month of the year $X$. $$\begin{aligned} Y &= \beta_1 + \beta_2\cos\left(\frac{\pi X}{6}\right) + \beta_3\sin\left(\frac{\pi X}{6}\right) \\ &+ \beta_5\cos( 2\pi X\beta_4) + \beta_6\sin( 2\pi X\beta_4) + \beta_8\cos( 2\pi X\beta_7 ) + \beta_9\sin( 2\pi X\beta_7 )+\epsilon \end{aligned}$$ Each update to find $$\beta_{k+1} \in \arg\min \left\lbrace c_t + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k \right\Vert_{M_k^{-1}}^2: \beta \in \mathbb{R}^9 \right\rbrace$$ would be too expensive to carry out for, say, large data sets.

Kalman-based SGD

Approximation

Replacing $c_t$ with a linear approximation $$ \arg\min \left\lbrace c_{t,k} + \nabla^T c_{t,k}(\beta - \beta_k) + \frac{1}{2\alpha_k}\left\Vert \beta - \beta_{k} \right\Vert_{M_k^{-1}}^2 \right\rbrace$$ Replacing $c_t$ with a quadratic approximation $$ \arg\min \left\lbrace c_{t,k} + \nabla^T c_{t,k}(\beta - \beta_k) + \frac{1}{2}(\beta - \beta_k)^T \nabla^2 c_{t,k} (\beta - \beta_{k}) + \frac{1}{2\alpha_k}\left\Vert \beta - \beta_{k} \right\Vert_{M_k^{-1}}^2 \right\rbrace$$ Replacing $c_t$ with a conditional expectation $$ \arg\min \left\lbrace c_{t,k} + \nabla^T c_{t,k}(\beta - \beta_k) + \frac{1}{2V(\mu)}\left[(\beta-\beta_k)^T \nabla \mu_k\right]^2 + \frac{1}{2\alpha_k}\left\Vert \beta - \beta_{k} \right\Vert_{M_k^{-1}}^2 \right\rbrace$$

Thank You

Ongoing Projects

Slides
This slide deck and its iterations can be found at galton.uchicago.edu/~vpatel.

Reference

Patel, Vivak. "Kalman-based Stochastic Gradient Method with Stop Condition and Insensitivity to Conditioning." arXiv preprint arXiv:1512.01139 (2015).

/