A Hidden Markov Model Approach to Statistical Estimation and Numerical Optimization under Uncertainty

Vivak Patel

Department of Statistics, University of Chicago
April 21, 2017

Part I: Statistical Estimation

I. Motivating Problems & Current Solutions

II. An HMM Formulation (not novel)

III. Some Results for Linear Regression (novel)

IV. Ongoing Work (novel)

Problems & Solutions

Linear Regression

Let $X,X_1,X_2,\ldots \in \mathbb{R}^{d}$ be independent, identically distributed, $l^2$ random variables such that $$ 0 \prec \mathbb{E}\left[XX'\right] $$

Let $\epsilon,\epsilon_1,\epsilon_2,\ldots$ be independent, identically distributed, mean zero, finite variance random variables.

Let $\beta^* \in \mathbb{R}^d$, and define $$ Y = X'\beta^* + \epsilon \text{ and } Y_k = X_k'\beta^* + \epsilon_k$$

Problem: Given $\lbrace (Y_k,X_k):k=1,\ldots,N \rbrace$ find an estimator for $\beta^*$ which scales as $N \to \infty$.

Problems & Solutions

Linear Regression

Problem: Given $\lbrace (Y_k,X_k):k=1,\ldots,N \rbrace$ find an estimator for $\beta^*$ which scales as $N \to \infty$.


Solution 1: At each $N$ compute $$\beta_N = \arg \min \Vert Y - X\beta \Vert_2^2.$$

Solution 2: At each $N$ compute $$\beta_N = \beta_{N-1} + \alpha_{N-1}X_{N}(Y_N - X_{N}'\beta_{N-1}).$$

Problems & Solutions

Linear Regression

Problems & Solutions

Logistic Regression (Canonical Link, Linear Predictor)

Problem: Given $\lbrace (Y_k,X_k):k=1,\ldots,N \rbrace$ find an estimator for $\beta^*$ which scales as $N \to \infty$.


Solution 1: At each $N$ compute $$\beta_N = \arg \min \sum_{k=1}^N \log(1 + \exp(X_k'\beta)) - Y_k X_k'\beta.$$

Solution 2: At each $N$ compute $$\beta_N = \beta_{N-1} + \alpha_{N-1}X_N \left( Y_N - \frac{1}{1 + \exp(-X_N'\beta_{N-1})} \right).$$

An HMM Formulation

An HMM Formulation

Statistical Filtering

If we view the estimation problem as a filtering problem, then we can use statistical filter to estimate (Kalman, Extended Kalman, Unscented, Particle, etc.)

Kalman Filter for Linear Regression $$ \beta_N = \beta_{N-1} + M_{N-1}X_{N}\frac{ Y_N - X_{N}'\beta_{N-1}}{\gamma + X_{N}'M_{N-1}X_{N}}$$ where $$ M_{N} = (I - \frac{M_{N-1}X_{N}X_{N}'}{\gamma + X_{N}'M_{N-1}X_N})M_{N-1}.$$

Some Results

Summary of Results for Linear Regression

  1. The parameter estimator is consistent
  2. The covariance estimator is consistent up to a multiplicative factor.
  3. There is a robust set of choices for the parameter $\gamma$
  4. Achieves the (information theoretic) "lower bound" convergence rate complexity.

What You Will Not Find

  1. Asymptotic distribution results
  2. A framework for extending the Kalman Filter to other estimation problems
Patel, Vivak. "Kalman-Based Stochastic Gradient Method with Stop Condition and Insensitivity to Conditioning." SIAM Journal on Optimization 26, no. 4 (2016): 2620-2648.

Some Results

Linear Regression (CMS Payments, Demographic + Service Type)

Some Results

Haar Wavelet Regression (Gas Concentration Proportion, Censor Reading)

Some Results

Logistic Regression (Adult Income Category, Demographics)

Ongoing Work

I. Low-Memory Variant of the Kalman Filter

II. Consistency of the Kalman Filter for GLMs

III. A Kalman Filter for Models with Convex Constraints

IV. On Why SGD Fails in Practice: Divergence, Stalling, Conditioning

V. A Survey of Computationally Tractable Incremental Estimation

Part II: Optimization & Uncertainty

I. Motivating Problems & Current Solutions

II. An HMM Formulation (novel)

III. Some Numerical Experiments

Problems & Solutions

Composite Objective Function Minimization

Problem: Consider the following optimization problem $$ \min \sum_{i=1}^N f_i(\theta),$$ where $N$ is very large. For such problems, it is often infeasible to parse all $N$ component functions per iteration owing to memory and communication limitations. This prevents us from using standard optimization methods such as gradient descent, quasi-Newton, etc.

Solution 1: Subsample the objective function and use an incremental estimator.

Solution 2: Subsample the objective function and use a stochastic method designed a discrete, finite probability space.

Problems & Solutions

Model-based Gradient Free Optimization

Problem: Consider the problem of optimizing an objective function $f(\theta)$, but any attempt to evaluate $f(\theta)$ only returns $$ f(\theta) + \epsilon,$$ where $\epsilon$ is a random variable which may be strongly correlated to $f(\theta)$. Moreover, the gradient of $f(\theta)$ cannot be evaluated, noisy or otherwise.

Solution: ...

Problems & Solutions

Linear Stochastic Program with Recourse

Problem: Let $T$ and $h$ be matrix-valued and vector-valued random variables. Consider $$\begin{aligned} \min_{\theta} ~& c'\theta + \mathbb{E}\left[\min_{y} q'y \right] \\ \text{subject to} ~& A\theta = b \\ & T\theta + Wy = h \\ & \theta \in \Theta \\ & y \in Y \end{aligned}$$

Solution: Generate samples of $T,h$ and replace the expectation with an empirical estimate.

An HMM Formulation

Consider the following generic problem $$\begin{aligned} \min_{\theta}~& \mathbb{E}\left[ f(\theta,\xi) \right] \end{aligned}$$ where $\xi$ is a random variable which we can independently sample from at each iteration to produce $\xi_1,\xi_2,\ldots$.

Suppose also that $$ \nabla_\theta \mathbb{E}\left[ f(\theta,\xi) \right] = \mathbb{E}\left[ \nabla_\theta f(\theta,\xi) \right]$$

Finally, suppose that for given values of $\theta$ and $\xi$, $f(\theta,\xi)$ and $\nabla_\theta f(\theta,\xi)$ can be evaluated.

Note: We can use a model based approach if $\nabla_\theta f(\theta,\xi)$ cannot be evaluated. Other modifications can also be made if $f$ can be decomposed into a non-random component and random components.

An HMM Formulation

Hidden & Observed States: $\mathbb{E}[f]$, $\mathbb{E}[\nabla f]$, $f(\cdot,\xi_k)$ and $\nabla f(\cdot, \xi_k)$

Transition Dynamics: $$ \mathbb{E}[f(\theta,\xi)] = \mathbb{E}[f(\theta_k,\xi_k)] + \nabla \mathbb{E}[f(\theta_k,\xi_k)]'(\theta - \theta_k) + \epsilon \Vert \theta - \theta_k \Vert_2^2,$$ where $\epsilon$ is mean zero and has some finite variance.

Likelihood Model: $$ f(\theta,\xi_k) = \mathbb{E}[f(\theta,\xi)] + \eta,$$ where $\eta$ is mean zero and has some finite variance.

Note: It is unlikely that $\eta$ is mean zero, but we can take the average of multiple independent samples and use the normal approximation to justify the relationship.

Early Numerical Examples

Linear Regression

Early Numerical Examples

Rosenbrock Function





Thank You

vivakpatel.org

/