1\documentclass[11pt, a4paper]{article}
2\usepackage[a4paper, text={16cm,25cm}]{geometry}
3
4%\VignetteIndexEntry{covMcd() -- Generalizing the FastMCD}
5%\VignetteDepends{robustbase}
6\SweaveOpts{prefix.string=mcd, eps=FALSE, pdf=TRUE, strip.white=true}
7\SweaveOpts{width=6, height=4.1}
8
9\usepackage{amsmath}
10\usepackage{amsfonts}% \mathbb
11\usepackage{mathtools}% -> \floor, \ceil
12\usepackage[utf8]{inputenc}
13%% The following is partly R's share/texmf/Rd.sty
14\usepackage{color}
15\usepackage{hyperref}
16\definecolor{Blue}{rgb}{0,0,0.8}
17\definecolor{Red}{rgb}{0.7,0,0}
18\hypersetup{%
19  hyperindex,%
20  colorlinks={true},%
21  pagebackref,%
22  linktocpage,%
23  plainpages={false},%
24  linkcolor={Blue},%
25  citecolor={Blue},%
26  urlcolor={Red},%
27  pdfstartview={Fit},%
28  pdfview={XYZ null null null}%
29}
30
31\usepackage{natbib}
32\usepackage[noae]{Sweave}
33%----------------------------------------------------
34\DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
35\DeclarePairedDelimiter{\floor}{\lfloor}{\rfloor}
36\DeclareMathOperator{\sign}{sign}
37\newcommand{\abs}[1]{\left| #1 \right|}
38\newtheorem{definition}{Definition}
39\newcommand{\byDef}{\mathrm{by\ default}}
40\newcommand{\R}{{\normalfont\textsf{R}}{}}
41\newcommand{\code}[1]{\texttt{#1}}
42\newcommand*{\pkg}[1]{\texttt{#1}}
43\newcommand*{\CRANpkg}[1]{\href{https://CRAN.R-project.org/package=#1}{\pkg{#1}}}
44
45%----------------------------------------------------
46\begin{document}
47\setkeys{Gin}{width=0.9\textwidth}
48\setlength{\abovecaptionskip}{-5pt}
49
50\title{covMcd() -- Considerations about Generalizing the FastMCD}
51\author{Martin M\"achler}
52\maketitle
53%\tableofcontents
54%%
55%% Pison, G., Van Aelst, S., and Willems, G. (2002)
56%% Small Sample Corrections for LTS and MCD.
57%% Metrika % ~/save/papers/robust-diverse/Pison_VanAelst_Willems.pdf
58%%
59<<init, echo=FALSE>>=
60# set margins for plots
61options(SweaveHooks=list(fig=function() par(mar=c(3,3,1.4,0.7),
62                         mgp=c(1.5, 0.5, 0))),
63        width = 75)
64@
65
66\section{Introduction}
67The context is robust multivariate ``location and scatter'' estimation,
68which corresponds to estimating the first two moments in cases they exist.
69We assume data and a model
70\begin{align}
71  \label{eq:data-model}
72  x_i & \in  \mathbb{R}^p, \ \ i=1,2,\dots,n \\
73  x_i & \sim \mathcal{F}(\mu, \Sigma), \ \ \mathrm{i.i.d.};\ \
74  \mu \in \mathbb{R}^p, \ \ \Sigma \in \mathbb{R}^{p \times p}, \ \textrm{positive definite},
75\end{align}
76where a conceptual null model is the  $p$-dimensional normal distribution.
77One typical assumption is that $\mathcal{F}$ is a mixture with the majority
78component (``good data'') being $\mathcal{N}_p(\mu, \Sigma)$ and other components
79modeling ``the outliers''.
80
81In other words, we want estimates $\bigl(\hat{\mu}, \hat{\Sigma}\bigr)$ which should
82be close to the true ``good data'' $(\mu, \Sigma)$ --- and do not say more here.
83
84\section{MCD and ``the Fast'' MCD (= \textsc{fastmcd}) Algorithm}
85The \CRANpkg{robustbase} \R{} package has featured a function
86\code{covMcd()} since early on (Feb.~2006)
87and that has been an interface to the Fortran routine provided by the
88original authors and (partly) described in \citet{RouPvD99}.
89%% Rousseeuw, P. J. and van Driessen, K. (1999)
90%% A fast algorithm for the minimum covariance determinant estimator.
91%% Technometrics {41}, 212--223.
92%% >> ~/save/papers/robust-diverse/Rousseeuw_VanD-FastMCD_1999.pdf
93%     ------------------------------------------------------------
94We describe shortly how the algorithm works, partly building on the
95documentation provided in the source (R, S, and Fortran) codes:
96
97%%  R CMD Rdconv --type=latex ../../man/covMcd.Rd > covMcd.tex
98The minimum covariance determinant estimator of location and scatter (MCD)
99implemented in \code{covMcd()} is similar to \R{} function
100\code{cov.mcd()} in \CRANpkg{MASS}.  The (``theoretical'') MCD looks for
101the $h = h_\alpha (> 1/2)$ out of $n$ observations whose classical covariance
102matrix has the lowest possible determinant.  In more detail, we will use
103$h = h_\alpha = h(\alpha,n,p) \approx \alpha \cdot (n+p+1)$,
104where as \citet{RouPvD99} mainly use (the default) $\alpha = \frac 1 2$,
105where $h = h(1/2, n, p) = \floor[\Big]{\frac{n+p+1}{2}}$.
106For general $\alpha \ge \frac 1 2$, the \R{} implementation (derived
107from their original S code) uses
108$h = h(\alpha,n,p) =$ \code{h.alpha.n(alpha,n,p)} (function in
109\pkg{robustbase}), which is
110\begin{eqnarray}
111  \label{eq:def-h}
112  h = h_\alpha = h(\alpha,n,p) := \floor{2 n_2 - n + 2 \alpha (n - n_2)}, \
113  \mathrm{where\ } n_2 := \floor[\Big]{\frac{n+p+1}{2}}%
114  %= (n+p+1)/2 \ \ (\mathrm{\ where ``/'' denotes \emph{integer} division})
115  .
116\end{eqnarray}
117The fraction $\alpha \ge \frac 1 2$ can be chosen by the user,
118where $\alpha = \frac 1 2$ is the most robust, and indeed, $h_{1/2} = n_2 =
119\floor[\Big]{\frac{n+p+1}{2}}$.  Even in general, as long as $n \gg p$,
120$\alpha$ is approximately the \emph{proportion} of the subsample size $h$ in
121the full sample (size $n$):
122\begin{equation}
123  \label{eq:h.approx}
124  h \approx \alpha \cdot n \iff \alpha \approx \frac{h}{n},
125\end{equation}
126<<h.alpha.ex>>=
127require(robustbase)
128n <- c(5, 10, 20, 30, 50, 100, 200, 500)
129hmat <- function(alpha, p) cbind(n, h.alpha = h.alpha.n (alpha, n,p),
130        h. = floor(alpha * (n + p + 1)), alpha.n = round(alpha * n))
131hmat(alpha = 1/2, p = 3)
132hmat(alpha = 3/4, p = 4)
133@
134
135The breakdown point (for $h > \frac{n}{2}$) then is
136\begin{eqnarray}
137  \label{eq:breakdown}
138  \epsilon_{*} = \frac{n-h+1}{n},
139\end{eqnarray}
140which is less than but close to $\frac 1 2$ for $\alpha = \frac 1 2$, and in general,
141$h/n \approx \alpha$, the breakdown point is approximately,
142\begin{eqnarray}
143  \label{eq:eps-approx}
144  \epsilon_{*} = \frac{n-h+1}{n} \approx \frac{n-h}{n} = 1 - \frac{h}{n}
145  \approx 1 - \alpha.
146\end{eqnarray}
147
148
149The raw MCD estimate of location, say $\hat{\mu}_0$, is then the average of these $h$ points,
150whereas the raw MCD estimate of scatter, $\hat{\Sigma}_0$, is their covariance matrix,
151multiplied by a consistency factor \code{.MCDcons(p, h/n)}) and (by default)
152a finite sample correction factor \code{.MCDcnp2(p, n, alpha)}, to make it
153consistent at the normal model and unbiased at small samples.
154%% Both rescaling factors (consistency and finite sample) are returned in the length-2 vector
155%% \code{raw.cnp2}.
156
157In practice, for reasonably sized $n$, $p$ and hence $h$, it is not
158feasible to search the full space of all $n \choose h$ $h$-subsets of $n$
159observations.  Rather, the implementation of \code{covMcd} uses the Fast
160MCD algorithm of \citet{RouPvD99} to approximate the minimum covariance
161determinant estimator, see Section~\ref{sec:fastMCD}.
162
163Based on these raw MCD estimates, $\bigl(\hat{\mu}_0, \hat{\Sigma}_0\bigr)$,
164% (unless argument \code{raw.only} is true),
165a reweighting step is performed, i.e., \code{V <- cov.wt(x,w)},
166where \code{w} are weights determined by ``outlyingness''
167with respect to the scaled raw MCD, using the ``Mahalanobis''-like, robust distances
168$d_i\bigl(\hat{\mu}_0, \hat{\Sigma}_0\bigr)$, see (\ref{eq:Maha}).
169Again, a consistency factor and
170%(if \code{use.correction} is true)
171a finite sample correction factor %(\code{.MCDcnp2.rew(p, n, alpha)})
172are applied.
173The reweighted covariance is typically considerably more efficient
174than the raw one, see \citet{PisGvAW02}.
175
176The two rescaling factors for the reweighted estimates are returned in
177\code{cnp2}.  Details for the computation of the finite sample
178correction factors can be found in \citet{PisGvAW02}.
179
180\section{Fast MCD Algorithm -- General notation}\label{sec:fastMCD}
181
182\paragraph{Note:} In the following, apart from the mathematical notation, we also use
183variable names, e.g., \code{kmini}, used in the Fortran and sometimes \R{} function code,
184in \R{} package \CRANpkg{robustbase}.
185
186Instead of directly searching for $h$-subsets (among ${n \choose h} \approx
187{n \choose n/2}$) the basic idea is to start with small subsets of size
188$p+1$, their center $\mu$ and covariance matrix $\Sigma$, and a corresponding $h$-subset
189of the $h$ observations with smallest (squared) (``Mahalanobis''-like) distances
190\begin{align}  \label{eq:Maha}
191d_i = d_i(\mu,\Sigma) :=  (x_i - \mu)' \Sigma^{-1} (x_i - \mu), \ \ i=1,2,\dots,n,
192\end{align}
193and then use concentration
194steps (``C~steps'') to (locally) improve the chosen set by iteratively
195computing $\mu$, $\Sigma$, new distances $d_i$ and a new set of size $h$ with
196smallest distances $d_i(\mu,\Sigma)$. Each C~step is proven to decrease the
197determinant $\det(\Sigma)$ if $\mu$ and $\Sigma$ did change at all.
198Consequently, convergence to a local minimum is sure, as the number of $h$-subsets is finite.
199
200To make the algorithm \emph{fast} for non small sample size $n$ the data set is
201split into ``groups'' or ``sub-datasets'' as soon as
202\begin{eqnarray} \label{eq:nmini}
203  n \ge 2 n_0, \ \mathrm{ where}\ \  n_0 := \mathtt{nmini} \ ( = 300, \byDef).
204\end{eqnarray}
205i.e., the default cutoff for ``non small'' is at $n = 600$.
206%%
207The \emph{number} of such subsets in the original algorithm is maximally 5,
208and we now use
209\begin{eqnarray}  \label{eq:kmini}
210  k_M = \code{kmini} \ (= 5, \byDef),
211\end{eqnarray}
212as upper limit. As above, we assume from now on that $n \ge 2 n_0$, and let
213\begin{eqnarray} \label{eq:k-def}
214  k := \floor[\Big]{\frac{n}{n_0}} \ge 2
215\end{eqnarray}
216and now distinguish the two cases,
217\begin{eqnarray}
218  \label{eq:cases}
219  \begin{cases}
220    A. & k   < k_M  \iff n  <  k_M \cdot n_0 \\
221    B. & k \ge k_M  \iff n \ge k_M \cdot n_0
222  \end{cases}
223\end{eqnarray}
224\begin{description}
225\item[In case A] $k$ (\code{= ngroup}) subsets aka ``groups'' or ``sub datasets''
226  are used, $k \in\{2,3,\dots,k_M-1\}$, of group sizes $n_j$,
227  $j=1,\dots,k$ (see below).  Note that case~A may be empty because of $2
228  \le k < k_M$, namely if $k_M=2$.  Hence, in case~A, we have $k_M \ge 3$.
229\item[in case B] $k_M$ (\code{= ngroup}) groups each of size $n_0$ are
230  built and in the first stage, only a \emph{subset} of $k_M \cdot n_0 \le n$
231  observations is used.
232\end{description}
233In both cases, the disjoint groups (``sub datasets'') are chosen at random
234from the $n$ observations.
235%%
236For the group sizes for case~A, $n_j$, $j=1,\dots,k$, we have
237\begin{align}
238  n_1 = \; & \floor[\Big]{\frac n k} =
239             \floor[\bigg]{\frac{n}{\floor[\big]{\frac{n}{n_0}}}} \ \ (\; \ge n_0 \label{eq:n1})\\
240  n_j = \; & n_1,\hspace*{2.8em} j = 2,\dots,j_* \\
241  n_j = \; & n_1 + 1, \ \ \    j = j_* +1,\dots,k, \label{n1-plus-1}\\
242  & \qquad \mathrm{where}\ \ j_* := k - r \ \in \{1,\dots,k\}, \label{jstar}\\
243  & \qquad \mathrm{and}\ \  r := n - k n_1 = \label{r-rest}
244                           n - k\floor[\big]{\frac n k} \in \{0,1,\dots,k-1\},
245\end{align}
246where the range of $j_*$, $1,\dots,k$ in (\ref{jstar}) is a consequence of the range of
247the integer division remainder $r \in \{0,1,\dots,k-1\}$ in
248(\ref{r-rest}).  Consequently, (\ref{n1-plus-1}) maybe empty, namely iff
249$r=0$ ($\iff n = k \cdot n_1$ is a multiple of $k$): $j_* = k$, and all $n_j \equiv n_1$.
250
251Considering the range of $n_j$ in case~A, the minimum $n_1 \ge n_0$ in
252(\ref{eq:n1}) is easy to verify.  What is the maximal value of $n_j$ , i.e.,
253an upper bound for $n_{\max} := n_1+1 \ge \max_j n_j$? \
254%%
255%% This is all correct but not useful:
256%% From (\ref{eq:n1}),     $ n/k - 1 < n_1 \le n/k  $, and
257%% from (\ref{eq:k-def}),  $n/n_0 - 1 < k  \le n/n_0$.
258%% Putting these two together, we get
259%% \begin{eqnarray}
260%%   \label{eq:n1-ineq}
261%%   \frac{n^2}{n_0} - 1 \le n/k - 1 < n_1 \le n/k < \frac{n n_0}{n - n_0},
262%% \end{eqnarray}
263%% (the first $\le$ from $\frac{1}{k} \ge \frac{n_0}{n}$; the last $<$ from
264%% $\frac{1}{k} < \frac 1{n/n_0 -1} = \frac{n_0}{n-n_0}$.) Also,
265%% from (\ref{eq:k-def}), $n \ge k n_0$ and $n-n_0 \ge (k-1)n_0$ and since we
266%% are in case~A, $n < n_0 k_M$, which combines to
267%% \begin{eqnarray}
268%%   \label{eq:nn0}
269%%   \frac{n n_0}{n - n_0} < \frac{(n_0 k_M) n_0}{(k-1)n_0} = \frac{n_0 k_M}{k-1}.
270%% \end{eqnarray}
271Consider $n_{1,\max}(k) = \max_{n, \mathrm{given\ } k} n_1 = \max_{n, \mathrm{given\ } k} \floor{\frac n k}$.
272Given $k$, the maximal $n$ still fulfilling $\floor[\big]{\frac{n}{n_0}} = k$
273is $n = (k+1)n_0 - 1$ where $\floor[\big]{\frac{n}{n_0}} = k + \floor[\big]{1 - \frac{1}{n_0}} = k$.
274Hence, $n_{1,\max}(k) =\floor[\big]{\frac{(k+1)n_0 - 1}{k}} = n_0 +
275\floor[\big]{\frac{n_0 - 1}{k}}$, and as $k \ge 2$, the maximum is at $k=2$,
276$\max n_1 = \max_k n_{1,\max}(k) = n_0 + \floor[\big]{\frac{n_0 - 1}{2}} = \floor[\big]{\frac{3 n_0 - 1}{2}}$.
277Taken together, as $n_j = n_1+1$ is possible, we have
278\begin{align}
279  \label{eq:nj-range}
280  n_0 \le & n_1  \le \floor[\Big]{\frac{3 n_0 - 1}{2}} \nonumber\\
281  n_0 \le & n_j  \le \floor[\Big]{\frac{3 n_0 + 1}{2}}, \ \ j \ge 2.
282\end{align}
283Note that indeed, $\floor[\big]{\frac{3 n_0 + 1}{2}}$ is the length of the
284auxiliary vector \code{subndex} in the Fortran code.
285
286\bibliographystyle{chicago}
287\bibliography{robustbase}
288
289\end{document}
290