1\documentclass[11pt, a4paper]{article} 2\usepackage[a4paper, text={16cm,25cm}]{geometry} 3 4%\VignetteIndexEntry{covMcd() -- Generalizing the FastMCD} 5%\VignetteDepends{robustbase} 6\SweaveOpts{prefix.string=mcd, eps=FALSE, pdf=TRUE, strip.white=true} 7\SweaveOpts{width=6, height=4.1} 8 9\usepackage{amsmath} 10\usepackage{amsfonts}% \mathbb 11\usepackage{mathtools}% -> \floor, \ceil 12\usepackage[utf8]{inputenc} 13%% The following is partly R's share/texmf/Rd.sty 14\usepackage{color} 15\usepackage{hyperref} 16\definecolor{Blue}{rgb}{0,0,0.8} 17\definecolor{Red}{rgb}{0.7,0,0} 18\hypersetup{% 19 hyperindex,% 20 colorlinks={true},% 21 pagebackref,% 22 linktocpage,% 23 plainpages={false},% 24 linkcolor={Blue},% 25 citecolor={Blue},% 26 urlcolor={Red},% 27 pdfstartview={Fit},% 28 pdfview={XYZ null null null}% 29} 30 31\usepackage{natbib} 32\usepackage[noae]{Sweave} 33%---------------------------------------------------- 34\DeclarePairedDelimiter{\ceil}{\lceil}{\rceil} 35\DeclarePairedDelimiter{\floor}{\lfloor}{\rfloor} 36\DeclareMathOperator{\sign}{sign} 37\newcommand{\abs}[1]{\left| #1 \right|} 38\newtheorem{definition}{Definition} 39\newcommand{\byDef}{\mathrm{by\ default}} 40\newcommand{\R}{{\normalfont\textsf{R}}{}} 41\newcommand{\code}[1]{\texttt{#1}} 42\newcommand*{\pkg}[1]{\texttt{#1}} 43\newcommand*{\CRANpkg}[1]{\href{https://CRAN.R-project.org/package=#1}{\pkg{#1}}} 44 45%---------------------------------------------------- 46\begin{document} 47\setkeys{Gin}{width=0.9\textwidth} 48\setlength{\abovecaptionskip}{-5pt} 49 50\title{covMcd() -- Considerations about Generalizing the FastMCD} 51\author{Martin M\"achler} 52\maketitle 53%\tableofcontents 54%% 55%% Pison, G., Van Aelst, S., and Willems, G. (2002) 56%% Small Sample Corrections for LTS and MCD. 57%% Metrika % ~/save/papers/robust-diverse/Pison_VanAelst_Willems.pdf 58%% 59<<init, echo=FALSE>>= 60# set margins for plots 61options(SweaveHooks=list(fig=function() par(mar=c(3,3,1.4,0.7), 62 mgp=c(1.5, 0.5, 0))), 63 width = 75) 64@ 65 66\section{Introduction} 67The context is robust multivariate ``location and scatter'' estimation, 68which corresponds to estimating the first two moments in cases they exist. 69We assume data and a model 70\begin{align} 71 \label{eq:data-model} 72 x_i & \in \mathbb{R}^p, \ \ i=1,2,\dots,n \\ 73 x_i & \sim \mathcal{F}(\mu, \Sigma), \ \ \mathrm{i.i.d.};\ \ 74 \mu \in \mathbb{R}^p, \ \ \Sigma \in \mathbb{R}^{p \times p}, \ \textrm{positive definite}, 75\end{align} 76where a conceptual null model is the $p$-dimensional normal distribution. 77One typical assumption is that $\mathcal{F}$ is a mixture with the majority 78component (``good data'') being $\mathcal{N}_p(\mu, \Sigma)$ and other components 79modeling ``the outliers''. 80 81In other words, we want estimates $\bigl(\hat{\mu}, \hat{\Sigma}\bigr)$ which should 82be close to the true ``good data'' $(\mu, \Sigma)$ --- and do not say more here. 83 84\section{MCD and ``the Fast'' MCD (= \textsc{fastmcd}) Algorithm} 85The \CRANpkg{robustbase} \R{} package has featured a function 86\code{covMcd()} since early on (Feb.~2006) 87and that has been an interface to the Fortran routine provided by the 88original authors and (partly) described in \citet{RouPvD99}. 89%% Rousseeuw, P. J. and van Driessen, K. (1999) 90%% A fast algorithm for the minimum covariance determinant estimator. 91%% Technometrics {41}, 212--223. 92%% >> ~/save/papers/robust-diverse/Rousseeuw_VanD-FastMCD_1999.pdf 93% ------------------------------------------------------------ 94We describe shortly how the algorithm works, partly building on the 95documentation provided in the source (R, S, and Fortran) codes: 96 97%% R CMD Rdconv --type=latex ../../man/covMcd.Rd > covMcd.tex 98The minimum covariance determinant estimator of location and scatter (MCD) 99implemented in \code{covMcd()} is similar to \R{} function 100\code{cov.mcd()} in \CRANpkg{MASS}. The (``theoretical'') MCD looks for 101the $h = h_\alpha (> 1/2)$ out of $n$ observations whose classical covariance 102matrix has the lowest possible determinant. In more detail, we will use 103$h = h_\alpha = h(\alpha,n,p) \approx \alpha \cdot (n+p+1)$, 104where as \citet{RouPvD99} mainly use (the default) $\alpha = \frac 1 2$, 105where $h = h(1/2, n, p) = \floor[\Big]{\frac{n+p+1}{2}}$. 106For general $\alpha \ge \frac 1 2$, the \R{} implementation (derived 107from their original S code) uses 108$h = h(\alpha,n,p) =$ \code{h.alpha.n(alpha,n,p)} (function in 109\pkg{robustbase}), which is 110\begin{eqnarray} 111 \label{eq:def-h} 112 h = h_\alpha = h(\alpha,n,p) := \floor{2 n_2 - n + 2 \alpha (n - n_2)}, \ 113 \mathrm{where\ } n_2 := \floor[\Big]{\frac{n+p+1}{2}}% 114 %= (n+p+1)/2 \ \ (\mathrm{\ where ``/'' denotes \emph{integer} division}) 115 . 116\end{eqnarray} 117The fraction $\alpha \ge \frac 1 2$ can be chosen by the user, 118where $\alpha = \frac 1 2$ is the most robust, and indeed, $h_{1/2} = n_2 = 119\floor[\Big]{\frac{n+p+1}{2}}$. Even in general, as long as $n \gg p$, 120$\alpha$ is approximately the \emph{proportion} of the subsample size $h$ in 121the full sample (size $n$): 122\begin{equation} 123 \label{eq:h.approx} 124 h \approx \alpha \cdot n \iff \alpha \approx \frac{h}{n}, 125\end{equation} 126<<h.alpha.ex>>= 127require(robustbase) 128n <- c(5, 10, 20, 30, 50, 100, 200, 500) 129hmat <- function(alpha, p) cbind(n, h.alpha = h.alpha.n (alpha, n,p), 130 h. = floor(alpha * (n + p + 1)), alpha.n = round(alpha * n)) 131hmat(alpha = 1/2, p = 3) 132hmat(alpha = 3/4, p = 4) 133@ 134 135The breakdown point (for $h > \frac{n}{2}$) then is 136\begin{eqnarray} 137 \label{eq:breakdown} 138 \epsilon_{*} = \frac{n-h+1}{n}, 139\end{eqnarray} 140which is less than but close to $\frac 1 2$ for $\alpha = \frac 1 2$, and in general, 141$h/n \approx \alpha$, the breakdown point is approximately, 142\begin{eqnarray} 143 \label{eq:eps-approx} 144 \epsilon_{*} = \frac{n-h+1}{n} \approx \frac{n-h}{n} = 1 - \frac{h}{n} 145 \approx 1 - \alpha. 146\end{eqnarray} 147 148 149The raw MCD estimate of location, say $\hat{\mu}_0$, is then the average of these $h$ points, 150whereas the raw MCD estimate of scatter, $\hat{\Sigma}_0$, is their covariance matrix, 151multiplied by a consistency factor \code{.MCDcons(p, h/n)}) and (by default) 152a finite sample correction factor \code{.MCDcnp2(p, n, alpha)}, to make it 153consistent at the normal model and unbiased at small samples. 154%% Both rescaling factors (consistency and finite sample) are returned in the length-2 vector 155%% \code{raw.cnp2}. 156 157In practice, for reasonably sized $n$, $p$ and hence $h$, it is not 158feasible to search the full space of all $n \choose h$ $h$-subsets of $n$ 159observations. Rather, the implementation of \code{covMcd} uses the Fast 160MCD algorithm of \citet{RouPvD99} to approximate the minimum covariance 161determinant estimator, see Section~\ref{sec:fastMCD}. 162 163Based on these raw MCD estimates, $\bigl(\hat{\mu}_0, \hat{\Sigma}_0\bigr)$, 164% (unless argument \code{raw.only} is true), 165a reweighting step is performed, i.e., \code{V <- cov.wt(x,w)}, 166where \code{w} are weights determined by ``outlyingness'' 167with respect to the scaled raw MCD, using the ``Mahalanobis''-like, robust distances 168$d_i\bigl(\hat{\mu}_0, \hat{\Sigma}_0\bigr)$, see (\ref{eq:Maha}). 169Again, a consistency factor and 170%(if \code{use.correction} is true) 171a finite sample correction factor %(\code{.MCDcnp2.rew(p, n, alpha)}) 172are applied. 173The reweighted covariance is typically considerably more efficient 174than the raw one, see \citet{PisGvAW02}. 175 176The two rescaling factors for the reweighted estimates are returned in 177\code{cnp2}. Details for the computation of the finite sample 178correction factors can be found in \citet{PisGvAW02}. 179 180\section{Fast MCD Algorithm -- General notation}\label{sec:fastMCD} 181 182\paragraph{Note:} In the following, apart from the mathematical notation, we also use 183variable names, e.g., \code{kmini}, used in the Fortran and sometimes \R{} function code, 184in \R{} package \CRANpkg{robustbase}. 185 186Instead of directly searching for $h$-subsets (among ${n \choose h} \approx 187{n \choose n/2}$) the basic idea is to start with small subsets of size 188$p+1$, their center $\mu$ and covariance matrix $\Sigma$, and a corresponding $h$-subset 189of the $h$ observations with smallest (squared) (``Mahalanobis''-like) distances 190\begin{align} \label{eq:Maha} 191d_i = d_i(\mu,\Sigma) := (x_i - \mu)' \Sigma^{-1} (x_i - \mu), \ \ i=1,2,\dots,n, 192\end{align} 193and then use concentration 194steps (``C~steps'') to (locally) improve the chosen set by iteratively 195computing $\mu$, $\Sigma$, new distances $d_i$ and a new set of size $h$ with 196smallest distances $d_i(\mu,\Sigma)$. Each C~step is proven to decrease the 197determinant $\det(\Sigma)$ if $\mu$ and $\Sigma$ did change at all. 198Consequently, convergence to a local minimum is sure, as the number of $h$-subsets is finite. 199 200To make the algorithm \emph{fast} for non small sample size $n$ the data set is 201split into ``groups'' or ``sub-datasets'' as soon as 202\begin{eqnarray} \label{eq:nmini} 203 n \ge 2 n_0, \ \mathrm{ where}\ \ n_0 := \mathtt{nmini} \ ( = 300, \byDef). 204\end{eqnarray} 205i.e., the default cutoff for ``non small'' is at $n = 600$. 206%% 207The \emph{number} of such subsets in the original algorithm is maximally 5, 208and we now use 209\begin{eqnarray} \label{eq:kmini} 210 k_M = \code{kmini} \ (= 5, \byDef), 211\end{eqnarray} 212as upper limit. As above, we assume from now on that $n \ge 2 n_0$, and let 213\begin{eqnarray} \label{eq:k-def} 214 k := \floor[\Big]{\frac{n}{n_0}} \ge 2 215\end{eqnarray} 216and now distinguish the two cases, 217\begin{eqnarray} 218 \label{eq:cases} 219 \begin{cases} 220 A. & k < k_M \iff n < k_M \cdot n_0 \\ 221 B. & k \ge k_M \iff n \ge k_M \cdot n_0 222 \end{cases} 223\end{eqnarray} 224\begin{description} 225\item[In case A] $k$ (\code{= ngroup}) subsets aka ``groups'' or ``sub datasets'' 226 are used, $k \in\{2,3,\dots,k_M-1\}$, of group sizes $n_j$, 227 $j=1,\dots,k$ (see below). Note that case~A may be empty because of $2 228 \le k < k_M$, namely if $k_M=2$. Hence, in case~A, we have $k_M \ge 3$. 229\item[in case B] $k_M$ (\code{= ngroup}) groups each of size $n_0$ are 230 built and in the first stage, only a \emph{subset} of $k_M \cdot n_0 \le n$ 231 observations is used. 232\end{description} 233In both cases, the disjoint groups (``sub datasets'') are chosen at random 234from the $n$ observations. 235%% 236For the group sizes for case~A, $n_j$, $j=1,\dots,k$, we have 237\begin{align} 238 n_1 = \; & \floor[\Big]{\frac n k} = 239 \floor[\bigg]{\frac{n}{\floor[\big]{\frac{n}{n_0}}}} \ \ (\; \ge n_0 \label{eq:n1})\\ 240 n_j = \; & n_1,\hspace*{2.8em} j = 2,\dots,j_* \\ 241 n_j = \; & n_1 + 1, \ \ \ j = j_* +1,\dots,k, \label{n1-plus-1}\\ 242 & \qquad \mathrm{where}\ \ j_* := k - r \ \in \{1,\dots,k\}, \label{jstar}\\ 243 & \qquad \mathrm{and}\ \ r := n - k n_1 = \label{r-rest} 244 n - k\floor[\big]{\frac n k} \in \{0,1,\dots,k-1\}, 245\end{align} 246where the range of $j_*$, $1,\dots,k$ in (\ref{jstar}) is a consequence of the range of 247the integer division remainder $r \in \{0,1,\dots,k-1\}$ in 248(\ref{r-rest}). Consequently, (\ref{n1-plus-1}) maybe empty, namely iff 249$r=0$ ($\iff n = k \cdot n_1$ is a multiple of $k$): $j_* = k$, and all $n_j \equiv n_1$. 250 251Considering the range of $n_j$ in case~A, the minimum $n_1 \ge n_0$ in 252(\ref{eq:n1}) is easy to verify. What is the maximal value of $n_j$ , i.e., 253an upper bound for $n_{\max} := n_1+1 \ge \max_j n_j$? \ 254%% 255%% This is all correct but not useful: 256%% From (\ref{eq:n1}), $ n/k - 1 < n_1 \le n/k $, and 257%% from (\ref{eq:k-def}), $n/n_0 - 1 < k \le n/n_0$. 258%% Putting these two together, we get 259%% \begin{eqnarray} 260%% \label{eq:n1-ineq} 261%% \frac{n^2}{n_0} - 1 \le n/k - 1 < n_1 \le n/k < \frac{n n_0}{n - n_0}, 262%% \end{eqnarray} 263%% (the first $\le$ from $\frac{1}{k} \ge \frac{n_0}{n}$; the last $<$ from 264%% $\frac{1}{k} < \frac 1{n/n_0 -1} = \frac{n_0}{n-n_0}$.) Also, 265%% from (\ref{eq:k-def}), $n \ge k n_0$ and $n-n_0 \ge (k-1)n_0$ and since we 266%% are in case~A, $n < n_0 k_M$, which combines to 267%% \begin{eqnarray} 268%% \label{eq:nn0} 269%% \frac{n n_0}{n - n_0} < \frac{(n_0 k_M) n_0}{(k-1)n_0} = \frac{n_0 k_M}{k-1}. 270%% \end{eqnarray} 271Consider $n_{1,\max}(k) = \max_{n, \mathrm{given\ } k} n_1 = \max_{n, \mathrm{given\ } k} \floor{\frac n k}$. 272Given $k$, the maximal $n$ still fulfilling $\floor[\big]{\frac{n}{n_0}} = k$ 273is $n = (k+1)n_0 - 1$ where $\floor[\big]{\frac{n}{n_0}} = k + \floor[\big]{1 - \frac{1}{n_0}} = k$. 274Hence, $n_{1,\max}(k) =\floor[\big]{\frac{(k+1)n_0 - 1}{k}} = n_0 + 275\floor[\big]{\frac{n_0 - 1}{k}}$, and as $k \ge 2$, the maximum is at $k=2$, 276$\max n_1 = \max_k n_{1,\max}(k) = n_0 + \floor[\big]{\frac{n_0 - 1}{2}} = \floor[\big]{\frac{3 n_0 - 1}{2}}$. 277Taken together, as $n_j = n_1+1$ is possible, we have 278\begin{align} 279 \label{eq:nj-range} 280 n_0 \le & n_1 \le \floor[\Big]{\frac{3 n_0 - 1}{2}} \nonumber\\ 281 n_0 \le & n_j \le \floor[\Big]{\frac{3 n_0 + 1}{2}}, \ \ j \ge 2. 282\end{align} 283Note that indeed, $\floor[\big]{\frac{3 n_0 + 1}{2}}$ is the length of the 284auxiliary vector \code{subndex} in the Fortran code. 285 286\bibliographystyle{chicago} 287\bibliography{robustbase} 288 289\end{document} 290