16 Appendix
16.1 Notation
We use the following notation
Notation | Explanation |
---|---|
\(u\) | The random effects vector |
\(\theta\) | Parameter vector (first part) |
\(\beta\) | Parameter vector (second part) |
\(f(u,\beta,\theta)\) | Joint negative log likelihood |
\(x\) | Data |
\(E(u|x)\) | Conditional expectation of random effect given data |
\(\hat u\) | The posterior mode \(\arg \min_{u} f(u,\beta,\theta)\) |
16.2 Profiling the inner problem
This section describes the underlying theory of the argument profile
to MakeADFun
intended to speedup and robustify linear mixed effect
models with a large number of fixed effects. With a few common model
properties (Assumption 1 and 2 below), which must be checked by the
user, one can apply the profile
argument to move outer parameters to
the inner problem without affecting the model result.
Theorem 1 (Profiling inner problem) Assume that for any \(\beta\) and \(\theta\)
- Assumption 1 The partial derivative \(\partial_{\beta} f(u,\beta,\theta)\) is a linear function of u.
- Assumption 2 The posterior mean is equal to the posterior mode: \(E(u|x)=\hat u\)
Then the MLE
\[\hat \beta := \arg \max_{\beta} \left( \int \exp(-f(u,\beta,\theta)) \: du \right) \]
is a solution to the augmented system
\[ \begin{split} \partial_{u} f(u,\beta,\theta) &= 0 \\ \partial_{\beta} f(u,\beta,\theta) &= 0 \end{split} \]
The augmented system defines \(\hat \beta\) implicitly as function of the posterior mode \(\hat u\).
Proof
Differentiation of the negative log marginal likelihood gives
\[ \begin{split} \partial_{\beta} \left( -\log \int \exp(-f(u,\beta,\theta)) \: du \right) &= E(\partial_{\beta}f(u,\beta,\theta) |x) \\ &= \partial_{\beta} f(u,\beta,\theta)_{|u=\hat u(\beta,\theta)} \end{split} \]
where the first equality holds in general and the second equality follows from assumptions (1) and (2).
\(\square\)
16.2.1 Example
The standard situation for which assumption 1 holds is when the \(\beta\)s are the linear fixed effects of a mixed model. In this case the joint negative log density takes the form \[ f(u,\beta,\theta) = \frac{1}{2}(u-A\beta)'\Sigma_{\theta}^{-1}(u-A\beta) + ... \] for some design matrix \(A\) where ’ \(...\) ’ does not depend on \(\beta\). The derivative \[ \partial_{\beta} f(u,\beta,\theta) = A'\Sigma_{\theta}^{-1}(u-A\beta) \] is thus a linear function of the random effect \(u\).
In general assumption 2 holds exact for models with a symmetric (e.g. Gaussian) posterior distribution.
16.3 Theory underlying sdreport
This section supplements the documentation of ?sdreport
by adding
some missing details.
As previously, we consider a general latent variable model with parameter vector \(\theta\), random effect vector \(u\) and observation vector \(x\). The TMB estimation procedure works as follows:
- The MLE \(\hat\theta=\hat\theta(x)\) is calculated and used as estimator of \(\theta\).
- Denote by \(\hat u(\theta,x)\) the random effect mode depending on \(\theta\) and \(x\). Now, plug in the MLE, and we get our estimator \(\hat u\left(\hat\theta(x),x\right)\) of \(u\).
In general, we assume that \(\hat\theta\) is a consistent estimator of \(\theta\). However, we do not in general require \(\hat u\) to be consistent for \(u\). The purpose of sdreport is, for a given realization of the pair \((u,x)\), to quantify the joint uncertainty of \((\hat u,\hat\theta)\) as estimator of \((u,\theta)\). That is, we are interested in the variance matrix of the difference
\[D:=\begin{pmatrix}\hat u\left(\hat\theta(x),x\right) - u\\ \hat\theta(x) - \theta\end{pmatrix}\]
An important point of the uncertainty quantification is to account for plugging in \(\hat\theta\) rather than using the true \(\theta\).
We calculate the variance using the standard formula:
\[V[D]=E(V(D|x))+V(E(D|x))\]
Consider \(D\) conditionally on \(x\). The second component does not depend on \(u\) and \(\hat u\) is constant given \(x\):
\[V[D|x]=\begin{pmatrix}V[u|x] & 0 \\ 0 & 0 \end{pmatrix}\]
It follows that
\[E(V[D|x])=\begin{pmatrix}E(V[u|x]) & 0 \\ 0 & 0 \end{pmatrix}\]
As central estimator of \(E(V[u|x])\) we use \(V[u|x]\) which is
approximated by the inverse random effect Hessian \(H_{uu}^{-1}\) based
on the assumption that \(u|x\) is well approximated by a Gaussian
distribution (a reasonable assumption given that we are using the
Laplace approximation). This explains the first term of variance formula in ?sdreport
:
\[E(V[D|x]) \approx \begin{pmatrix} H_{uu}^{-1} & 0 \\ 0 & 0 \end{pmatrix}\]
Likewise,
\[E[D|x]=\begin{pmatrix}\hat u\left(\hat\theta(x),x\right) - E(u|x)\\ \hat\theta(x) - \theta\end{pmatrix}\]
Again, asuming a Gaussian approximation of \(u|x\), it follows that \(E(u|x) \approx \hat u(\theta,x)\):
\[E[D|x]=\begin{pmatrix}\hat u\left(\hat\theta(x),x\right) - \hat u(\theta,x)\\ \hat\theta(x) - \theta\end{pmatrix}\]
We approximate the expectation using linerization of \(\theta \rightarrow \hat u(\theta,x)\) around \(\hat\theta(x)\)
\[E[D|x]=J_x \cdot (\hat\theta(x) - \theta)\]
We now have the second term of the variance formula in ?sdreport
:
\[V(E[D|x]) \approx J_x V(\hat\theta(x)) J_x'\]
This term becomes negligible if the amount of data is high because of the assumed asymptotic consistency of \(\hat\theta\).