1%% $Id$
2\documentclass[a4paper,11pt]{article}
3\usepackage[T1]{fontenc}
4\usepackage[utf8]{inputenc}
5\usepackage{a4wide}
6
7%% Setup for the vignette system:
8%\VignetteIndexEntry{Intruction to the pls Package}
9
10
11%%%
12%%% Local definitions:
13%%%
14%% Layout:
15%\setcounter{secnumdepth}{0}
16\pagestyle{headings}
17\renewcommand{\bfdefault}{b}
18
19\let\epsilon\varepsilon		% I prefer the \varepsilon variant.
20\newcommand\mat\mathrm          % Matrices
21%% Vectors: redefine \vec as bold italic instead of an arrow accent:
22\let\vec\relax\DeclareMathAlphabet{\vec}{OML}{cmm}{bx}{it}
23
24%% Ron's definitions:
25\newcommand{\bX}{\mbox{\boldmath{$X$}}}
26\newcommand{\bY}{\mbox{\boldmath{$Y$}}}
27\newcommand{\bB}{\mbox{\boldmath{$B$}}}
28\newcommand{\bT}{\mbox{\boldmath{$T$}}}
29\newcommand{\bP}{\mbox{\boldmath{$P$}}}
30\newcommand{\bU}{\mbox{\boldmath{$U$}}}
31\newcommand{\bD}{\mbox{\boldmath{$D$}}}
32\newcommand{\bV}{\mbox{\boldmath{$V$}}}
33\newcommand{\bA}{\mbox{\boldmath{$A$}}}
34\newcommand{\bE}{\mbox{\boldmath{$E$}}}
35\newcommand{\bF}{\mbox{\boldmath{$F$}}}
36\newcommand{\bQ}{\mbox{\boldmath{$Q$}}}
37\newcommand{\bS}{\mbox{\boldmath{$S$}}}
38\newcommand{\bW}{\mbox{\boldmath{$W$}}}
39\newcommand{\bR}{\mbox{\boldmath{$R$}}}
40
41%% From jss.cls (slightly modified):
42\let\code=\texttt
43\let\proglang=\textsf
44\newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}}
45\newcommand{\email}[1]{{\normalfont\texttt{#1}}}
46
47
48%%%%%%%%%%%%%%%%
49\begin{document}
50
51%%%%%%%%%%%%%%%%
52\title{Introduction to the \pkg{pls} Package}
53\author{
54  Bjørn-Helge Mevik\\ %\thanks for footnotes
55  University Center for Information Technology, University of Oslo\\
56  Norway\\
57\and
58  Ron Wehrens\\
59  Biometris, Wageningen University \& Research\\
60  The Netherlands\\
61}
62\maketitle
63
64%% Setup of Sweave:
65\SweaveOpts{width=5,height=5}
66\setkeys{Gin}{width=5in}
67<<echo=FALSE,results=hide>>=
68pdf.options(pointsize=10)
69options(digits = 4)
70@
71
72
73\begin{abstract}
74The \pkg{pls} package implements Principal Component Regression (PCR) and
75Partial Least Squares Regression (PLSR) in
76\proglang{R}, and is freely available from the
77CRAN website, licensed under the Gnu General Public License (GPL).
78
79The user interface is modelled after the traditional formula
80interface, as exemplified by \code{lm}.  This was done so that people
81used to \proglang{R} would not have to learn yet another interface, and also
82because we believe the formula interface is a good way of working
83interactively with models.  It thus has methods for generic functions like
84\code{predict}, \code{update} and \code{coef}.  It also has more specialised
85functions like \code{scores}, \code{loadings} and \code{RMSEP}, and a
86flexible cross-validation system.  Visual inspection and assessment is
87important in chemometrics, and the \pkg{pls} package has a number of plot
88functions for plotting scores, loadings, predictions, coefficients and RMSEP
89estimates.
90
91The package implements PCR and several algorithms for PLSR.  The design is
92modular, so that it should be easy to use the underlying algorithms in other
93functions.  It is our hope that the package will serve well both for
94interactive data analysis and as a building block for other functions or
95packages using PLSR or PCR.
96
97We will here describe the package and how it is used for data analysis, as
98well as how it can be used as a part of other packages.  Also included is a
99section about formulas and data frames, for people not used to the
100\proglang{R} modelling idioms.
101\end{abstract}
102
103
104\section{Introduction}\label{sec:introduction}
105%%%%%%%%%%%%%%%%
106This vignette is meant as an introduction to the \pkg{pls} package.  It is
107based on the paper `The pls Package: Principal Component and Partial Least
108Squares Regression in R', published in
109\emph{Journal of Statistical Software}~\cite{MevWeh:plsJSS}.
110
111The PLSR methodology is shortly described in Section~\ref{sec:theory}.
112Section~\ref{sec:example-session} presents an example session, to get an
113overview of the package.  In Section~\ref{sec:formula-data-frame} we
114describe formulas and data frames (as they are used in \pkg{pls}).  Users
115familiar with formulas and data frames in \proglang{R} can skip this section
116on first reading.  Fitting of models is described in
117Section~\ref{sec:fitting}, and cross-validatory choice of components is
118discussed in Section~\ref{sec:cross-validation}.  Next, inspecting and
119plotting models is described (Section~\ref{sec:inspecting}), followed by a
120section on predicting future observations (Section~\ref{sec:predicting}).
121Finally, Section~\ref{sec:advanced} covers more advanced topics such as
122parallel computing, setting options, using the underlying functions directly,
123and implementation details.
124
125
126\section{Theory}\label{sec:theory}
127%%%%%%%%%%%%%%%%
128Multivariate regression methods like Principal Component Regression
129(PCR) and Partial Least Squares Regression (PLSR) enjoy large
130popularity in a wide range of fields, including the natural
131sciences. The main reason is that they have been designed to confront
132the situation that there are many, possibly correlated, predictor
133variables, and relatively few samples---a situation that is common,
134especially in chemistry where developments in spectroscopy since the
135seventies have revolutionised chemical analysis. In fact, the origin
136of PLSR lies in chemistry (see, e.g.,~\cite{Wold2001,Martens2001}). The field
137of \emph{near-infrared} (NIR) spectroscopy,
138with its highly overlapping lines and difficult to interpret
139overtones, would not have existed but for a method to obtain
140quantitative information from the spectra.
141
142Also other fields have benefited greatly from multivariate regression
143methods like PLSR and PCR. In medicinal chemistry, for example, one
144likes to derive molecular properties from the molecular
145structure. Most of these Quantitative Structure-Activity Relations
146(QSAR, and also Quantitative Structure-Property Relations, QSPR), and
147in particular, Comparative Molecular
148Field Analysis (ComFA)~\cite{Cramer1988}, use PLSR. Other applications
149range from statistical process control~\cite{Kresta1991} to tumour
150classification~\cite{Nguyen2002} to spatial analysis in brain
151images~\cite{McIntosh1996} to marketing~\cite{Fornell1982}.
152
153In the usual multiple linear regression (MLR) context, the
154least-squares solution for
155\begin{equation}
156\bY = \bX\bB + \mathcal{E}
157\end{equation}
158is given by
159\begin{equation}
160\bB = (\bX^T \bX)^{-1} \bX^T \bY
161\label{eq:lsq}
162\end{equation}
163The problem often is that $\bX^T \bX$ is singular, either
164because the number of variables (columns) in $\bX$ exceeds the number
165of objects (rows), or because of collinearities. Both PCR and
166PLSR circumvent this by decomposing $\bX$ into orthogonal scores $\bT$ and
167loadings $\bP$
168\begin{equation}
169\bX = \bT \bP
170\end{equation}
171and regressing $\bY$ not on $\bX$ itself but on the first $a$ columns
172of the scores $\bT$. In PCR, the scores are given by the
173left singular vectors of $\bX$, multiplied with the corresponding
174singular values, and the loadings are the right singular vectors of
175$\bX$. This, however, only takes into account
176information about $\bX$, and therefore may be suboptimal for
177prediction purposes. PLSR aims to incorporate information on both $\bX$
178and $\bY$ in the definition of the scores and loadings. In fact, for
179one specific version of PLSR, called SIMPLS~\cite{Jong1993a}, it can be
180shown that the scores and loadings are chosen in such a way to
181describe as much as possible of the {\em covariance} between $\bX$ and
182$\bY$, where PCR concentrates on the {\em variance} of $\bX$. Other
183PLSR algorithms give identical results to SIMPLS in the case of one
184$\bY$ variable, but deviate slightly for the multivariate $\bY$ case;
185the differences are not likely to be important in practice.
186
187\subsection{Algorithms}
188In PCR, we approximate the $\bX$ matrix by the first $a$ Principal
189Components (PCs), usually obtained from the singular value
190decomposition (SVD):
191\[
192\bX = \tilde{\bX}_{(a)} + \mathcal{E}_X
193    = (\bU_{(a)} \bD_{(a)} ) \bV^T_{(a)} + \mathcal{E}_X
194    = \bT_{(a)} \bP_{(a)}^T + \mathcal{E}_X
195\]
196Next, we regress $\bY$ on the scores, which leads to regression
197coefficients
198\[
199\bB = \bP (\bT^T \bT)^{-1} \bT^T \bY
200    = \bV \bD^{-1} \bU^T \bY
201\]
202where the subscripts $a$ have been dropped for clarity.
203
204For PLSR, the components, called Latent Variables (LVs) in this
205context, are obtained iteratively. One starts with the
206SVD of the crossproduct matrix $\bS = \bX^T \bY$, thereby including
207information on variation in both $\bX$ and $\bY$, and on the
208correlation between them. The first left and right singular vectors,
209$w$ and $q$, are used as weight vectors for $\bX$ and $\bY$,
210respectively, to obtain scores $t$ and $u$:
211\begin{equation}
212t = \bX w = \bE w
213\end{equation}
214\begin{equation}
215u = \bY q = \bF q
216\end{equation}
217where $\bE$ and $\bF$ are initialised as $\bX$ and $\bY$,
218respectively. The X scores $t$ are often normalised:
219\begin{equation}
220t =  t / \sqrt{t^Tt}
221\end{equation}
222The Y scores $u$ are not actually necessary in the regression but are often
223saved for interpretation purposes. Next, X and Y loadings are
224obtained by regressing against the {\em same} vector $t$:
225\begin{equation}
226\label{eq:plspt}
227p = \bE^T t
228\end{equation}
229\begin{equation}
230\label{eq:plsqt}
231q = \bF^T t
232\end{equation}
233Finally, the data matrices are `deflated': the information related
234to this latent variable, in the form of the outer products $t p^T$ and
235$t q^T$, is subtracted from the (current) data matrices $\bE$ and $\bF$.
236\begin{equation}
237\bE_{n+1} = \bE_n - t p^T
238\end{equation}
239\begin{equation}
240\bF_{n+1} = \bF_n - t q^T
241\end{equation}
242
243The estimation of the next component then can start from the SVD of
244the crossproduct matrix $\bE_{n+1}^T\bF_{n+1}$. After every iteration,
245vectors $w$, $t$, $p$ and $q$ are saved as columns in matrices $\bW$,
246$\bT$, $\bP$ and $\bQ$, respectively. One complication is that columns
247of matrix $\bW$ can not be compared directly: they are derived from
248successively deflated matrices $\bE$ and $\bF$. It has been shown that
249an alternative way to represent the weights, in such a way that all
250columns relate to the original $\bX$ matrix, is given by
251\begin{equation}
252\bR = \bW (\bP^T \bW)^{-1}
253\end{equation}
254
255Now, we are in the same position as in the PCR case: instead of
256regressing $\bY$ on $\bX$, we use scores $\bT$ to calculate the
257regression coefficients, and later convert these back to the
258realm of the original variables by pre-multiplying with matrix $\bR$
259(since $\bT = \bX \bR$):
260\[
261\bB = \bR (\bT^T \bT)^{-1} \bT^T \bY
262    = \bR \bT^T \bY
263    = \bR \bQ^T
264\]
265Again, here, only the first $a$ components are used. How many
266components are optimal has to be determined, usually by
267cross-validation.
268
269Many alternative formulations can be found in literature. It has been
270shown, for instance, that only one of $\bX$ and $\bY$ needs to be
271deflated; alternatively, one can directly deflate the crossproduct
272matrix $\bS$ (as is done in SIMPLS, for example). Moreover, there are
273many equivalent ways of scaling. In the example above, the scores $t$
274have been normalised, but one can also choose to introduce
275normalisation at another point in the algorithm. Unfortunately, this
276can make it difficult to directly compare the scores and loadings of
277different PLSR implementations.
278
279\subsection{On the use of PLSR and PCR}
280In theory, PLSR should have an advantage over PCR. One could imagine a
281situation where a minor component in $\bX$ is highly correlated with
282$\bY$; not selecting enough components would then lead to very bad
283predictions. In PLSR, such a component would be automatically present
284in the first LV. In practice, however, there is hardly any difference
285between the use of PLSR and PCR; in most situations, the methods
286achieve similar prediction accuracies, although PLSR usually needs
287fewer latent variables than PCR. Put the other way around: with the
288same number of latent variables, PLSR will cover more of the variation
289in $\bY$ and PCR will cover more of $\bX$. In turn, both behave very
290similar to ridge regression~\cite{Frank1993}.
291
292It can also be shown that both PCR and PLSR behave
293as shrinkage methods~\cite{TESL}, although in some cases PLSR seems to
294increase the variance of individual regression coefficients, one possible
295explanation of why PLSR is not always better than PCR.
296
297
298\section{Example session}\label{sec:example-session}
299%%%%%%%%%%%%%%%%
300In this section we will walk through an example session, to get an overview
301of the package.
302
303To be able to use the package, one first has to load it:
304<<>>=
305library(pls)
306@
307This prints a message telling that the package has been attached, and that
308the package implements a function \code{loadings} that masks a function of
309the same name in package \pkg{stats}.  (The output of the commands have in
310some cases been suppressed to save space.)
311
312Three example data sets are included in \pkg{pls}:
313\begin{description}
314\item[\code{yarn}] A data set with 28 near-infrared spectra (\code{NIR}) of PET yarns,
315  measured at 268 wavelengths, as predictors, and density as response
316  (\code{density})~\cite{SwiWeiWijBuy_StrConRobMulCalMod}.  The data set also
317  includes a logical variable \code{train} which can be used to split the
318  data into a training data set of size 21 and test data set of size 7.  See
319  \code{?yarn} for details.
320\item[\code{oliveoil}] A data set with 5 quality measurements
321  (\code{chemical}) and 6 panel sensory panel variables (\code{sensory}) made
322  on 16 olive oil samples~\cite{Mas_etal_HandChemQualB}.  See
323  \code{?oliveoil} for details.
324\item[\code{gasoline}] A data set consisting of octane number
325  (\code{octane}) and NIR spectra (\code{NIR}) of 60 gasoline
326  samples~\cite{Kal:2DatNIR}.  Each NIR spectrum consists of 401 diffuse
327  reflectance measurements from 900 to 1700 nm.  See \code{?gasoline} for
328  details.
329\end{description}
330These will be used in the examples that follow.  To use the data sets, they
331must first be loaded:
332<<>>=
333data(yarn)
334data(oliveoil)
335data(gasoline)
336@
337For the rest of the paper, it will be assumed that the package and the data
338sets have been loaded as above.  Also, all examples are run with
339\code{options(digits = 4)}.
340
341\begin{figure}
342  \begin{center}
343<<fig=TRUE,echo=FALSE,height=1.5>>=
344par(mar = c(2, 4, 0, 1) + 0.1)
345matplot(t(gasoline$NIR), type = "l", lty = 1, ylab = "log(1/R)", xaxt = "n")
346ind <- pretty(seq(from = 900, to = 1700, by = 2))
347ind <- ind[ind >= 900 & ind <= 1700]
348ind <- (ind - 898) / 2
349axis(1, ind, colnames(gasoline$NIR)[ind])
350@
351  \caption{Gasoline NIR spectra}\label{fig:NIR}
352  \end{center}
353\end{figure}
354
355In this section, we will do a PLSR on the \code{gasoline} data to illustrate
356the use of \pkg{pls}.  The spectra are shown in Figure~\ref{fig:NIR}.  We
357first divide the data set into train and test data sets:
358<<>>=
359gasTrain <- gasoline[1:50,]
360gasTest <- gasoline[51:60,]
361@
362A typical way of fitting a PLSR model is
363<<>>=
364gas1 <- plsr(octane ~ NIR, ncomp = 10, data = gasTrain, validation = "LOO")
365@
366This fits a model with 10 components, and includes \emph{leave-one-out} (LOO)
367cross-validated predictions~\cite{LacMic:EstERRDA}.  We can get an overview
368of the fit and validation results with the \code{summary} method:
369<<>>=
370summary(gas1)
371@
372The validation results here are \emph{Root Mean Squared Error of Prediction}
373(RMSEP).  There are two cross-validation estimates: \code{CV} is the
374ordinary CV estimate, and \code{adjCV} is a bias-corrected CV
375estimate~\cite{MevCed:MSEPest}.  (For a LOO CV, there is virtually no
376difference).
377
378It is often simpler to judge the RMSEPs by plotting them:
379<<eval=FALSE>>=
380plot(RMSEP(gas1), legendpos = "topright")
381@
382%
383\begin{figure}
384  \begin{center}
385<<fig=TRUE,echo=FALSE,height=2.5>>=
386par(mar = c(4, 4, 2.5, 1) + 0.1)
387plot(RMSEP(gas1), legendpos = "topright")
388@
389  \caption{Cross-validated RMSEP curves for the \code{gasoline} data}\label{fig:RMSEPplsr}
390  \end{center}
391\end{figure}
392
393This plots the estimated RMSEPs as functions of the number of components
394(Figure~\ref{fig:RMSEPplsr}).  The \code{legendpos} argument adds a legend at
395the indicated position.  Two components seem to be enough.  This gives an
396RMSEP of
397\Sexpr{format(drop(RMSEP(gas1, "CV", ncomp = 2, intercept = FALSE)$val), digits = 3)}. %$
398As mentioned in the introduction, the main practical difference between PCR
399and PLSR is that PCR often needs more components than PLSR to achieve the
400same prediction error.  On this data set, PCR would need three components to
401achieve the same RMSEP.
402
403Once the number of components has been chosen, one can inspect different
404aspects of the fit by plotting predictions, scores, loadings, etc.
405The default plot is a prediction plot:
406<<eval=FALSE>>=
407plot(gas1, ncomp = 2, asp = 1, line = TRUE)
408@
409%
410\begin{figure}
411  \begin{center}
412\setkeys{Gin}{width=3.5in}
413<<fig=TRUE,echo=FALSE,width=3.5,height=3.7>>=
414par(mar = c(4, 4, 2.5, 1) + 0.1)
415plot(gas1, ncomp = 2, asp = 1, line = TRUE)
416@
417\setkeys{Gin}{width=5in}
418  \caption{Cross-validated predictions for the \code{gasoline}
419  data}\label{fig:cvpreds}
420  \end{center}
421\end{figure}
422
423This shows the cross-validated predictions with two components versus
424measured values (Figure~\ref{fig:cvpreds}).  We have chosen an aspect ratio
425of 1, and to draw a target line.  The points follow the target line quite
426nicely, and there is no indication of a curvature or other anomalies.
427
428Other plots can be selected with the argument \code{plottype}:
429<<eval=FALSE>>=
430plot(gas1, plottype = "scores", comps = 1:3)
431@
432%
433\begin{figure}
434  \begin{center}
435<<echo=FALSE,fig=TRUE>>=
436plot(gas1, plottype = "scores", comps = 1:3)
437@
438  \caption{Score plot for the \code{gasoline} data}\label{fig:scores}
439  \end{center}
440\end{figure}
441
442\begin{figure}
443  \begin{center}
444<<echo=FALSE,fig=TRUE,height=2.5>>=
445par(mar = c(4, 4, 0.3, 1) + 0.1)
446plot(gas1, "loadings", comps = 1:2, legendpos = "topleft",
447     labels = "numbers", xlab = "nm")
448abline(h = 0)
449@
450  \caption{Loading plot for the \code{gasoline} data}\label{fig:loadings}
451  \end{center}
452\end{figure}
453This gives a pairwise plot of the score values for the three first
454components (Figure~\ref{fig:scores}).  Score plots are often used to look for
455patterns, groups or outliers in the data.  (For instance, plotting the two
456first components for a model built on the \code{yarn} dataset clearly
457indicates the experimental design of that data.)  In this example, there is
458no clear indication of grouping or outliers.  The numbers in parentheses
459after the component labels are the relative amount of X variance
460explained by each component.  The explained variances can be extracted
461explicitly with
462<<>>=
463explvar(gas1)
464@
465
466The loading plot (Figure~\ref{fig:loadings}) is much
467used for interpretation purposes, for instance to look for known spectral
468peaks or profiles:
469<<eval=FALSE>>=
470plot(gas1, "loadings", comps = 1:2, legendpos = "topleft",
471     labels = "numbers", xlab = "nm")
472abline(h = 0)
473@
474%
475The \code{labels = "numbers"} argument makes the plot function try to
476interpret the variable names as numbers, and use them as $x$ axis labels.
477
478A fitted model is often used to predict the response values of new
479observations.  The following predicts the responses for the ten observations
480in \code{gasTest}, using two components:
481<<>>=
482predict(gas1, ncomp = 2, newdata = gasTest)
483@
484Because we know the true response values for these samples, we can calculate
485the test set RMSEP:
486<<>>=
487RMSEP(gas1, newdata = gasTest)
488@
489For two components, we get
490\Sexpr{format(drop(RMSEP(gas1, ncomp = 2, intercept = FALSE, newdata = gasTest)$val), digits = 3)}, %$
491which is quite close to the cross-validated estimate above
492(\Sexpr{format(drop(RMSEP(gas1, "CV", ncomp = 2, intercept = FALSE)$val), digits = 3)}). %$
493
494
495\goodbreak
496\section{Formulas and data frames}\label{sec:formula-data-frame}
497%%%%%%%%%%%%%%%%
498The \pkg{pls} package has a formula interface that works like the formula
499interface in \proglang{R}'s standard \code{lm} functions, in most ways.
500This section gives a short description of formulas and data frames as they
501apply to \pkg{pls}.  More information on formulas can be found in the
502\code{lm} help file, in Chapter~11 of `An Introduction to R', and in
503Chapter~2 of `The White Book'~\cite{R:Chambers+Hastie:1992}.  These are
504good reads for anyone wanting to understand how \proglang{R} works with formulas, and
505the user is strongly advised to read them.
506
507
508\subsection{Formulas}
509%%%%%%%%%%%%%%%%
510A \emph{formula} consists of a \emph{left hand side} (lhs), a tilde
511(\code{\textasciitilde}), and a \emph{right hand side} (rhs).  The lhs
512consists of a single term, representing the response(s).  The rhs consists of
513one or more terms separated by \code{+}, representing the regressor(s).  For
514instance, in the formula \code{a \textasciitilde\ b + c + d}, \code{a} is the
515response, and \code{b}, \code{c}, and \code{d} are the regressors.  The
516intercept is handled automatically, and need not be specified in the formula.
517
518Each term represents a matrix, a numeric vector or a factor (a
519factor should not be used as the response).  If the response term is a
520matrix, a multi-response model is fit.
521In \pkg{pls}, the right hand side quite often consists of a
522single term, representing a matrix regressor: \code{y \textasciitilde\ X}.
523
524It is also possible to specify transformations of the variables.  For
525instance, \code{log(y) \textasciitilde\ msc(Z)} specifies a regression of the
526logarithm of \code{y} onto \code{Z} after \code{Z} has been transformed by
527\emph{Multiplicative Scatter (or Signal) Correction} (MSC)~\cite{Geladi1985},
528a pre-treatment that is very common in infrared spectroscopy.  If the
529transformations contain symbols that are interpreted in the formula handling,
530e.g., \code{+}, \code{*} or \verb|^|, the terms should be protected with the
531\code{I()} function, like this: \code{y \textasciitilde\ x1 + I(x2 + x3)}.  This specifies
532\emph{two} regressors: \code{x1}, and the sum of \code{x2} and \code{x3}.
533
534
535\subsection{Data frames}
536%%%%%%%%%%%%%%%%
537The fit functions first look for the specified variables in a supplied
538data frame, and it is advisable to collect all variables there.  This
539makes it easier to know what data has been used for fitting, to keep
540different variants of the data around, and to predict new data.
541
542To create a data frame, one can use the \code{data.frame} function: if
543\code{v1}, \code{v2} and \code{v3} are factors or numeric vectors,
544\code{mydata <- data.frame(y = v1, a = v2, b = v3)} will result in a data
545frame with variables named \code{y}, \code{a} and \code{b}.
546
547PLSR and PCR are often used with a matrix as the single predictor term
548(especially when one is working with spectroscopic data).  Also,
549multi-response models require a matrix as the response term.  If \code{Z} is
550a matrix, it has to be protected by the `protect function' \code{I()} in
551calls to \code{data.frame}: \code{mydata <- data.frame(..., Z = I(Z))}.
552Otherwise, it will be split into separate variables for each column, and
553there will be no variable called \code{Z} in the data frame, so we cannot
554use \code{Z} in the formula.
555One can also add the matrix to an existing data frame:
556\begin{verbatim}
557> mydata <- data.frame(...)
558> mydata$Z <- Z
559\end{verbatim}
560%$
561This will also prevent \code{Z} from being split into separate variables.
562Finally, one can use \code{cbind} to combine vectors and matrices into
563matrices on the fly in the formula.  This is most useful for the response, e.g.,
564\code{cbind(y1, y2) \textasciitilde\ X}.
565
566Variables in a data frame can be accessed with the \code{\$} operator, e.g.,
567\code{mydata\$y}.  However, the \pkg{pls} functions access the variables
568automatically, so the user should never use \code{\$} in formulas.
569
570
571\section{Fitting models}\label{sec:fitting}
572%%%%%%%%%%%%%%%%
573The main functions for fitting models are \code{pcr} and \code{plsr}.  (They
574are simply wrappers for the function \code{mvr}, selecting the appropriate
575fit algorithm).  We will use \code{plsr} in the examples in this section,
576but everything could have been done with \code{pcr} (or \code{mvr}).
577
578In its simplest form, the function call for fitting models is
579\code{plsr(formula, ncomp, data)} (where \code{plsr} can be substituted with
580\code{pcr} or \code{mvr}).  The argument \code{formula} is a formula as
581described above, \code{ncomp} is the number of components one wishes to fit,
582and \code{data} is the data frame containing the variables to use in the
583model.  The function returns a fitted model (an object of class
584\code{"mvr"}) which can be inspected (Section~\ref{sec:inspecting}) or used
585for predicting new observations (Section~\ref{sec:predicting}).  For
586instance:
587<<>>=
588dens1 <- plsr(density ~ NIR, ncomp = 5, data = yarn)
589@
590If the response term of the formula is a matrix, a multi-response model is
591fit, e.g.,
592<<>>=
593dim(oliveoil$sensory)
594plsr(sensory ~ chemical, data = oliveoil)
595@
596(As we see, the \code{print} method simply tells us what type of model this
597is, and how the fit function was called.)
598
599The argument \code{ncomp} is optional.  If it is missing, the maximal
600possible number of components are used.  Also \code{data} is optional, and
601if it is missing, the variables specified in the formula is searched for in
602the global environment (the user's workspace).  Usually, it is preferable to
603keep the variables in data frames, but it can sometimes be convenient to
604have them in the global environment.  If the variables reside in a data
605frame, e.g.\ \code{yarn}, \emph{do not} be tempted to use formulas like
606\code{yarn\$density \textasciitilde\ yarn\$NIR}!  Use \code{density
607  \textasciitilde\ NIR} and specify the
608data frame with \code{data = yarn} as above.
609
610There are facilities for working interactively with models.  To use only
611part of the samples in a data set, for instance the first 20, one can use
612arguments \code{subset = 1:20} or \code{data = yarn[1:20,]}.
613Also, if one wants to try different alternatives of the model, one can use
614the function \code{update}.  For instance
615<<>>=
616trainind <- which(yarn$train == TRUE)
617dens2 <- update(dens1, subset = trainind)
618@
619will refit the model \code{dens1} using only the observations which are
620marked as \code{TRUE} in \code{yarn\$train}, and
621<<>>=
622dens3 <- update(dens1, ncomp = 10)
623@
624will change the number of components to 10.  Other arguments, such as
625\code{formula}, can also be changed with \code{update}.  This can save a bit
626of typing when working interactively with models (but it doesn't save
627computing time; the model is refitted each time).
628In general, the reader is referred to `The White
629Book'~\cite{R:Chambers+Hastie:1992} or `An Introduction to R' for more
630information about fitting and working with models in \proglang{R}.
631
632Missing data can sometimes be a problem.  The PLSR and PCR algorithms
633currently implemented in \pkg{pls} do not handle missing values
634intrinsically, so observations with missing values must be removed.  This
635can be done with the \code{na.action} argument.  With \code{na.action =
636na.omit} (the default), any observation with missing values will be removed
637from the model completely.  With \code{na.action = na.exclude}, they will be
638removed from the fitting process, but included as \code{NA}s in the
639residuals and fitted values.  If you want an explicit error when there are
640missing values in the data, use \code{na.action = na.fail}.  The default
641\code{na.action} can be set with \code{options()}, e.g.,
642\code{options(na.action = quote(na.fail))}.
643
644Standardisation and other pre-treatments of predictor variables are often
645called for.  In \code{pls}, the predictor variables are always centered, as a
646part of the fit algorithm.  Scaling can be requested with the \code{scale}
647argument.  If \code{scale} is \code{TRUE}, each variable is standardised by
648dividing it by its standard deviation, and if \code{scale} is a numeric
649vector, each variable is divided by the corresponding number.
650For instance, this will fit a model with standardised chemical
651measurements:
652<<>>=
653olive1 <- plsr(sensory ~ chemical, scale = TRUE, data = oliveoil)
654@
655
656As mentioned earlier, MSC~\cite{Geladi1985} is implemented in \pkg{pls} as a
657function \code{msc} that can be used in formulas:
658<<>>=
659gas2 <- plsr(octane ~ msc(NIR), ncomp = 10, data = gasTrain)
660@
661This scatter corrects \code{NIR} prior to the fitting, and arranges for new
662spectra to be automatically scatter corrected (using the same reference
663spectrum as when fitting) in \code{predict}:
664<<eval=FALSE>>=
665predict(gas2, ncomp = 3, newdata = gasTest)
666@
667
668There are other arguments that can be given in the fit call:
669\code{validation} is for selecting validation, and \code{...} is for sending
670arguments to the underlying functions, notably the cross-validation function
671\code{mvrCv}.  For the other arguments, see \code{?mvr}.
672
673
674\section{Choosing the number of components with cross-validation}\label{sec:cross-validation}
675%%%%%%%%%%%%%%%%
676Cross-validation, commonly used to determine the optimal number of
677components to take into account, is controlled by the \code{validation}
678argument in the modelling functions (\code{mvr}, \code{plsr} and
679\code{pcr}). The default value is \code{"none"}. Supplying a value of
680\code{"CV"} or \code{"LOO"} will cause the modelling procedure to call
681\code{mvrCv} to perform cross-validation; \code{"LOO"} provides
682leave-one-out cross-validation, whereas \code{"CV"} divides the data into
683segments. Default is to use ten segments, randomly selected, but also
684segments of consecutive objects or interleaved segments (sometimes also
685referred to as `Venetian blinds') are possible through the use of the
686argument \code{segment.type}.  One can also specify the segments explicitly
687with the argument \code{segments}; see \code{?mvrCv} for details.
688
689When validation is performed in this way, the
690model will contain an element comprising information on the out-of-bag
691predictions (in the form of predicted values, as well as MSEP and R2
692values). As a reference, the MSEP error using no components at all is
693calculated as well. The validation results can be visualised using the
694\code{plottype = "validation"} argument of the standard plotting
695function. An example is shown in Figure~\ref{fig:RMSEPplsr} for the
696gasoline data; typically, one would select a number of components
697after which the cross-validation error does not show a significant
698decrease.
699
700The decision on how many components to retain will to some extent
701always be subjective. However, especially when building large numbers
702of models (e.g., in simulation studies), it can be crucial to have a
703consistent strategy on how to choose the ``optimal'' number of
704components. Two such strategies have been implemented in function
705\code{selectNcomp}. The first is based on the so-called one-sigma
706heuristic~\cite{TESL2013} and consists of choosing the model with fewest
707components that is still less than one standard error away from the
708overall best model. The second strategy employs a permutation
709approach, and basically tests whether adding a new component is
710beneficial at all~\cite{Voet1994}. It is implemented backwards, again
711taking the global minimum in the crossvalidation curve as a starting
712point, and assessing models with fewer and fewer components: as long
713as no significant deterioration in performance is found (by default on
714the $\alpha = 0.01$ level), the algorithm
715continues to remove components. Applying the function is quite
716straightforward:
717<<eval=FALSE>>=
718ncomp.onesigma <- selectNcomp(gas2, method = "onesigma", plot = TRUE,
719                              ylim = c(.18, .6))
720ncomp.permut <- selectNcomp(gas2, method = "randomization", plot = TRUE,
721                            ylim = c(.18, .6))
722@
723This leads to the plots in Figure~\ref{fig:NComp} -- note that
724graphical arguments can be supplied to customize the plots. In both cases,
725the global minimum of the crossvalidation curve is indicated with gray
726dotted lines, and the suggestion for the optimal number of components
727with a vertical blue dashed line. The left plot shows the width of the
728one-sigma intervals on which the suggestion is based; the right plot
729indicates which models have been assessed by the permutation approach
730through the large blue circles. The two criteria do not always agree
731(as in this case) but usually are quite close.
732\begin{figure}[tb]
733\centering
734\setkeys{Gin}{width=\textwidth}
735<<fig=TRUE,echo=FALSE,height=4.5,width=10>>=
736par(mfrow = c(1,2))
737ncomp.onesigma <- selectNcomp(gas1, "onesigma", plot = TRUE,
738                              ylim = c(.18, .6))
739ncomp.permut <- selectNcomp(gas1, "randomization", plot = TRUE,
740                            ylim = c(.18, .6))
741@
742\caption{The two strategies for suggesting optimal model dimensions:
743  the left plot shows the one-sigma strategy, the right plot the
744  permutation strategy.}
745\label{fig:NComp}
746\end{figure}
747
748When a pre-treatment that depends on the composition of the training set is
749applied, the cross-validation procedure as described above is not optimal,
750in the sense that the cross-validation errors are biased downward.  As long
751as the only purpose is to select the optimal number of components, this bias
752may not be very important, but it is not too difficult to avoid it. The
753modelling functions have an argument \code{scale} that can be used for
754auto-scaling per segment.  However, more elaborate methods such as MSC need
755explicit handling per segment. For this, the function \code{crossval} is
756available. It takes an \code{mvr} object and performs the cross-validation
757as it should be done: applying the pre-treatment for each segment. The
758results can be shown in a plot (which looks very similar to
759Figure~\ref{fig:RMSEPplsr}) or summarised in numbers.
760<<>>=
761gas2.cv <- crossval(gas2, segments = 10)
762plot(MSEP(gas2.cv), legendpos="topright")
763summary(gas2.cv, what = "validation")
764@
765Applying MSC in this case leads to nearly identical cross-validation
766estimates of prediction error.
767
768When the scaling does not depend on the division of the data into
769segments (e.g., log-scaling), functions \code{crossval} and
770\code{mvrCv} give the same results; however, \code{crossval} is much
771slower.
772
773Cross-validation can be computationally demanding (especially when using the
774function \code{crossval}).  Therefore, both \code{mvrCv} and \code{crossval}
775can perform the calculations in parallel on a multi-core machine or on
776several machines.  How to do this is described in
777Section~\ref{sec:parallel-cv}.
778
779
780\section{Inspecting fitted models}\label{sec:inspecting}
781%%%%%%%%%%%%%%%%
782A closer look at the fitted model may reveal interesting agreements or
783disagreements with what is known about the relations between X and
784Y. Several functions are implemented in \pkg{pls} for plotting,
785extracting and summarising model components.
786
787
788\subsection{Plotting}
789%%%%%%%%%%%%%%%%
790One can access all plotting functions through the \code{"plottype"}
791argument of the \code{plot} method for \code{mvr} objects.  This is simply a wrapper
792function calling the actual plot functions; the latter are available
793to the user as well.
794
795The default plot is a prediction plot (\code{predplot}), showing predicted
796versus measured values.  Test set predictions are used if a test set is
797supplied with the \code{newdata} argument.  Otherwise, if the model was
798built using cross-validation, the cross-validated predictions are used,
799otherwise the predictions for the training set.  This can be overridden with
800the \code{which} argument.  An example of this type of plot can be seen in
801Figure~\ref{fig:cvpreds}. An optional argument can be used to indicate how
802many components should be included in the prediction.
803
804To assess how many components are optimal, a validation plot
805(\code{validationplot}) can be
806used such as the one shown in Figure~\ref{fig:RMSEPplsr}; this shows a
807measure of prediction performance (either RMSEP,
808MSEP, or $R^2$) against the number of components. Usually, one takes the
809first local minimum rather than the absolute minimum in the curve, to
810avoid over-fitting.
811
812The regression coefficients can be visualised using
813\code{plottype = "coef"} in the \code{plot} method, or directly through
814function \code{coefplot}. This allows simultaneous plotting of the
815regression vectors for several different numbers of components at
816once. The regression vectors for the \code{gasoline} data set using
817MSC are shown in Figure~\ref{fig:gascoefs} using the command
818<<eval=FALSE>>=
819plot(gas1, plottype = "coef", ncomp=1:3, legendpos = "bottomleft",
820     labels = "numbers", xlab = "nm")
821@
822\begin{figure}
823  \begin{center}
824<<fig=TRUE,echo=FALSE,height=3>>=
825par(mar = c(4, 4, 2.5, 1) + 0.1)
826plot(gas1, plottype = "coef", ncomp=1:3, legendpos = "bottomleft",
827     labels = "numbers", xlab = "nm")
828@
829  \caption{Regression coefficients for the \code{gasoline} data}\label{fig:gascoefs}
830  \end{center}
831\end{figure}
832Note that the coefficients for two components and three components are
833similar.  This is because the third component contributes little to the
834predictions.  The RMSEPs (see Figure~\ref{fig:RMSEPplsr}) and predictions
835(see Section~\ref{sec:predicting}) for two and three components are quite
836similar.
837
838Scores and loadings can be plotted using functions \code{scoreplot}
839(an example is shown in Figure~\ref{fig:scores})
840and \code{loadingplot} (in Figure~\ref{fig:loadings}),
841respectively. One can indicate the number of
842components with the \code{comps} argument; if more than two components
843are given, plotting the scores will give a pairs plot, otherwise a
844scatter plot. For \code{loadingplot}, the default is to use line plots.
845
846Finally, a `correlation loadings' plot (function \code{corrplot}, or
847\code{plottype = "correlation"} in \code{plot}) shows the
848correlations between each variable and the selected components (see
849Figure~\ref{fig:corrplot}). These
850plots are scatter plots of two sets of scores with concentric circles
851of radii given by \code{radii}. Each point corresponds to an X variable.
852The squared distance between the point and the origin equals the fraction
853of the variance of the variable explained by the components in the
854panel. The default values for \code{radii} correspond to 50\% and
855100\% explained variance, respectively.
856\begin{figure}
857  \begin{center}
858\setkeys{Gin}{width=3.5in}
859<<fig=TRUE,echo=FALSE,width=3.5,height=3.4>>=
860par(mar = c(4, 4, 0, 1) + 0.1)
861plot(gas1, plottype = "correlation")
862@
863\setkeys{Gin}{width=5in}
864  \caption{Correlation loadings plot for the \code{gasoline} data}\label{fig:corrplot}
865  \end{center}
866\end{figure}
867
868The plot functions accept most of the ordinary plot parameters, such as
869\code{col} and \code{pch}.
870If the model has several responses or one selects more than one
871model size, e.g.\ \code{ncomp = 4:6}, in some plot functions (notably
872prediction plots (see below), validation plots and coefficient plots) the
873plot window will be divided and one plot will be shown for
874each combination of response and model size.  The number of rows and columns
875are chosen automatically, but can be specified explicitly with arguments
876\code{nRows} and \code{nCols}.  If there are more plots than fit the plot
877window, one will be asked to press return to see the rest of the plots.
878
879
880\subsection{Extraction}
881%%%%%%%%%%%%%%%%
882Regression coefficients can be extracted using the generic function
883\code{coef}; the function takes several arguments, indicating the
884number of components to take into account, and whether the intercept is
885needed (default is \code{FALSE}).
886
887Scores and loadings can be extracted using functions \code{scores} and
888\code{loadings} for X, and \code{Yscores} and
889\code{Yloadings} for Y. These also return the percentage of variance
890explained as attributes. In PLSR, weights can be extracted using the function
891\code{loading.weights}. When applied to a PCR model, the function
892returns \code{NULL}.
893
894Note that commands like \code{plot(scores(gas1))} are perfectly
895correct, and lead to exactly the same plots as using \code{scoreplot}.
896
897
898\subsection{Summaries}
899%%%%%%%%%%%%%%%%
900The \code{print} method for an object of class \code{"mvr"} shows the
901regression type used, perhaps indicating the form of validation
902employed, and shows the function call. The \code{summary} method gives
903more information: it also shows the amount of variance explained by
904the model (for all choices of $a$, the number of latent
905variables). The \code{summary} method has an additional argument
906(\code{what}) to be able to focus on the training phase or validation
907phase, respectively. Default is to print both types of information.
908
909
910\section{Predicting new observations}\label{sec:predicting}
911%%%%%%%%%%%%%%%%
912Fitted models are often used to predict future observations, and
913\pkg{pls} implements a \code{predict} method for PLSR and PCR models.  The
914most common way of calling this function is
915\code{predict(mymod, ncomp = myncomp, newdata = mynewdata)}, where
916\code{mymod} is a fitted model, \code{myncomp} specifies the model size(s) to
917use, and \code{mynewdata} is a data frame with new X observations.  The
918data frame can also contain response measurements for the new observations,
919which can be used to compare the predicted values to the measured ones, or to
920estimate the overall prediction ability of the model.  If \code{newdata} is
921missing, \code{predict} uses the data used to fit the model, i.e., it returns
922fitted values.
923
924If the argument \code{ncomp} is missing, \code{predict} returns predictions
925for models with 1 component, 2 components, $\ldots$, $A$ components, where
926$A$ is the number of components used when fitting the model.  Otherwise, the
927model size(s) listed in \code{ncomp} are used.  For instance, to get
928predictions from the model built in Section~\ref{sec:example-session}, with
929two and three components, one would use
930<<>>=
931predict(gas1, ncomp = 2:3, newdata = gasTest[1:5,])
932@
933(We predict only the five first test observations, to save space.)  The
934predictions with two and three components are quite similar.  This could be
935expected, given that the regression vectors (Figure~\ref{fig:gascoefs})
936as well as the estimated RMSEPs for the two model sizes were similar.
937
938One can also specify explicitly which components to use when predicting.
939This is done by specifying the components in the argument \code{comps}.  (If
940both \code{ncomp} and \code{comps} are specified, \code{comps} takes
941precedence over \code{ncomp}.)  For instance, to get predictions from a
942model with only component 2, one can use
943<<>>=
944predict(gas1, comps = 2, newdata = gasTest[1:5,])
945@
946The results are different from the predictions with two components
947(i.e., components one and two) above.  (The intercept is always included in
948the predictions.  It can be removed by subtracting \code{mymod\$Ymeans}
949from the predicted values.)
950
951The \code{predict} method returns a three-dimensional array, in which the
952entry $(i,j,k)$ is the predicted value for observation $i$, response $j$ and
953model size $k$.  Note that singleton dimensions are not dropped, so
954predicting five observations for a uni-response model with \code{ncomp = 3}
955gives an $5 \times 1 \times 1$ array, not a vector of length five.  This is
956to make it easier to distinguish between predictions from models with one
957response and predictions with one model size.  (When using the \code{comps}
958argument, the last dimension is dropped, because the predictions are always
959from a single model.)  One can drop the singleton dimensions explicitly by
960using \code{drop(predict(...))}:
961<<>>=
962drop(predict(gas1, ncomp = 2:3, newdata = gasTest[1:5,]))
963@
964
965Missing values in \code{newdata} are propagated to \code{NA}s in the predicted
966response, by default.  This can be changed with the \code{na.action}
967argument.  See \code{?na.omit} for details.
968
969The \code{newdata} does not have to be a data frame.  Recognising the fact
970that the right hand side of PLSR and PCR formulas very often are a single
971matrix term, the \code{predict} method allows one to use a matrix as
972\code{newdata}, so instead of
973\begin{verbatim}
974newdataframe <- data.frame(X = newmatrix)
975predict(..., newdata = newdataframe)
976\end{verbatim}
977one can simply say
978\begin{verbatim}
979predict(..., newdata = newmatrix)
980\end{verbatim}
981However, there are a couple of caveats: First, this \emph{only} works in
982\code{predict}.  Other functions that take a \code{newdata} argument (such
983as \code{RMSEP}) must have a data frame (because they also need the response
984values).  Second, when \code{newdata} is a data frame, \code{predict} is
985able to perform more tests on the supplied data, such as the dimensions and
986types of variables.  Third, with the exception of scaling (specified with
987the \code{scale} argument when fitting the model), any transformations or
988coding of factors and interactions have to be performed manually if
989\code{newdata} is a matrix.
990
991It is often interesting to predict scores from new observations, instead of
992response values.  This can be done by specifying the argument \code{type =
993"scores"} in \code{predict}.  One will then get a matrix with the scores
994corresponding to the components specified in \code{comps} (\code{ncomp} is
995accepted as a synonym for \code{comps} when predicting scores).
996
997Predictions can be plotted with the function \code{predplot}.  This function
998is generic, and can also be used for plotting predictions from other types
999of models, such as \code{lm}.  Typically, \code{predplot} is called like this:
1000<<eval=FALSE>>=
1001predplot(gas1, ncomp = 2, newdata = gasTest, asp = 1, line = TRUE)
1002@
1003%
1004\begin{figure}
1005  \begin{center}
1006\setkeys{Gin}{width=3.5in}
1007<<echo=FALSE,fig=TRUE,width=3.5,height=3.7>>=
1008par(mar = c(4, 4, 2.5, 1))
1009predplot(gas1, ncomp = 2, newdata = gasTest, asp = 1, line = TRUE)
1010@
1011\setkeys{Gin}{width=5in}
1012  \caption{Test set predictions}\label{fig:testPreds}
1013  \end{center}
1014\end{figure}
1015
1016This plots predicted (with 2 components) versus measured response values.
1017(Note that \code{newdata} must be a data frame with both X and Y
1018variables.)
1019
1020
1021\section{Further topics}\label{sec:advanced}
1022%%%%%%%%%%%%%%%%
1023This section presents a couple of slightly technical topics for more
1024advanced use of the package.
1025
1026
1027\subsection{Selecting fit algorithms}\label{sec:select-fit-alg}
1028%%%%%%%%%%%%%%%%
1029There are several PLSR algorithms, and the \pkg{pls} package currently
1030implements three of them: the kernel algorithm for tall matrices (many
1031observations, few variables)~\cite{DayMacGre:ImprPlsAlg}, the classic
1032orthogonal scores algorithm (A.K.A.\ NIPALS algorithm)~\cite{MarNaes:MultCal}
1033and the SIMPLS algorithm~\cite{Jong1993a}.  The kernel and orthogonal
1034scores algorithms produce the same results (the kernel algorithm being the
1035fastest of them for most problems).  SIMPLS produces the same fit for
1036single-response models, but slightly different results for multi-response
1037models.  It is also usually faster than the NIPALS algorithm.
1038
1039The factory default is to use the kernel algorithm.  One can specify a
1040different algorithm with the \code{method} argument; i.e., \code{method =
1041"oscorespls"}.
1042
1043If one's personal taste of algorithms does not coincide with the defaults in
1044\pkg{pls}, it can be quite tedious (and error prone) having to write e.g.\
1045\code{method = "oscorespls"} every time (even though it can be shortened to
1046e.g.\ \code{me = "o"} due to partial matching).  Therefore, the defaults can
1047be changed, with the function \code{pls.options}.  Called without arguments,
1048it returns the current settings as a named list:
1049<<>>=
1050pls.options()
1051@
1052The options specify the default fit algorithm of \code{mvr}, \code{plsr},
1053and \code{pcr}.  To return only a specific option, one can use
1054\code{pls.options("plsralg")}.  To change the default PLSR algorithm for the
1055rest of the session, one can use, e.g.
1056<<>>=
1057pls.options(plsralg = "oscorespls")
1058@
1059Note that changes to the options only last until \proglang{R} exits.  (Earlier
1060versions of \pkg{pls} stored the changes in the global environment so they
1061could be saved and restored, but current CRAN policies do not allow this.)
1062
1063
1064\subsection{Parallel cross-validation}\label{sec:parallel-cv}
1065%%%%%%%%%%%%%%%%
1066Cross-validation is a computationally demanding procedure.  A new model has to
1067be fitted for each segment.  The underlying fit functions have been optimised,
1068and the implementation of cross-validation that is used when specifying
1069the \code{validation} argument to \code{mvr} tries to avoid any unneeded
1070calculations (and house-keeping things like the formula handling, which can be
1071surprisingly expensive).  Even so, cross-validation can take a long time, for
1072models with large matrices, many components or many segments.
1073
1074By default, the cross-validation calculations in \pkg{pls} is performed
1075serially, on one CPU (core).  (In the following, we will use `CPU' to denote
1076both CPUs and cores.)
1077
1078Since version 2.14.0, \proglang{R} has shipped with a package \pkg{parallel}
1079for running calculations in parallel, on multi-CPU machines or on
1080several machines.  The \pkg{pls} package can use the facilities of
1081\pkg{parallel} to run the cross-validations in parallel.
1082
1083The \pkg{parallel} package has several ways of running calculations in
1084parallel, and not all of them are available on all systems.  Therefore, the
1085support in \pkg{pls} is quite general, so one can select the ways that work
1086well on the given system.
1087
1088To specify how to run calculations in parallel, one sets the option
1089\code{parallel} in \code{pls.options}.  After setting the option, one simply
1090runs cross-validatons as before, and the calculations will be performed in
1091parallel.  This works both when using the \code{crossval} function and the
1092\code{validation} argument to \code{mvr}.  The parallel specification has
1093effect until it is changed.
1094
1095The default value for \code{parallel} is \code{NULL}, which specifies that the
1096calculations are done serially, using one CPU.  Specifying the value 1 has the
1097same effect.
1098
1099Specifying an integer $> 1$ makes the calculations use the function
1100\code{mclapply} with the given number as the number of CPUs to use.  Note:
1101\code{mclapply} depends on `forking' which does not exist on MS Windows, so
1102\code{mclapply} cannot be used there.
1103
1104Example:
1105\begin{verbatim}
1106pls.options(parallel = 4) # Use mclapply with 4 CPUs
1107gas1.cv <- plsr(octane ~ NIR, ncomp = 10, data = gasTrain, validation = "LOO")
1108\end{verbatim}
1109
1110The \code{parallel} option can also be specified as a cluster object created
1111by the \code{makeCluster} function from the package \code{parallel}.
1112Any following cross-validation will then be performed with the function
1113\code{parLapply} on that cluster.  Any valid cluster specification can be
1114used.  The user should stop the cluster with
1115\code{stopCluster(pls.options()\$parallel)} when it is no longer needed.
1116
1117\begin{verbatim}
1118library(parallel) # Needed for the makeCluster call
1119pls.options(parallel = makeCluster(4, type = "PSOCK")) # PSOCK cluster, 4 CPUs
1120gas1.cv <- plsr(octane ~ NIR, ncomp = 10, data = gasTrain, validation = "LOO")
1121## later:
1122stopCluster(pls.options()$parallel)
1123\end{verbatim}
1124
1125Several types of clusters are available: FORK uses forking, so starting the
1126cluster is very quick, however it is not available on MS Windows.  PSOCK
1127starts \proglang{R} processes with the \code{Rscript} command, which is
1128slower, but is supported on MS Windows.  It can also start worker processes on
1129different machines (see ?makeCluster for how).  MPI uses MPI to start and
1130communicate with processes.  This is the most flexible, but is often slower to
1131start up than the other types.  It also dependens on the packages \pkg{snow}
1132and \pkg{Rmpi} to be installed and working.  It is especially useful when
1133running batch jobs on a computing cluster, because MPI can interact with the
1134queue system on the cluster to find out which machines to use when the job
1135starts.
1136
1137Here is an example of running a batch job on a cluster using MPI:
1138
1139R script (myscript.R):
1140\begin{verbatim}
1141library(parallel) # for the makeCluster call
1142pls.options(parallel = makeCluster(16, type = "MPI") # MPI cluster, 16 CPUs
1143gas1.cv <- plsr(octane ~ NIR, ncomp = 10, data = gasTrain, validation = "LOO")
1144## later:
1145save.image(file = "results.RData")
1146stopCluster(pls.options()$parallel)
1147mpi.exit() # stop Rmpi
1148\end{verbatim}
1149
1150To run the job:
1151\begin{verbatim}
1152mpirun -np 1 R --slave --file=myscript.R
1153\end{verbatim}
1154The details of how to run \code{mpirun} varies between the different MPI
1155implementations and how they interact with the queue system used (if any).
1156The above should work for OpenMPI or Intel MPI running under the Slurm queue
1157system.  In other situations, one might have to specify which machines to use
1158with, e.g., the \code{-host} or \code{-machinefile} switch.
1159
1160
1161\subsection{Package design}\label{sec:package-design}
1162%%%%%%%%%%%%%%%%
1163The \pkg{pls} package is designed such that an interface function \code{mvr}
1164handles the formula and data, and calls an underlying fit function (and
1165possibly a cross-validation function) to do the real work.  There are
1166several reasons for this design: it makes it easier to implement new
1167algorithms, one can easily skip the time-consuming formula and data handling
1168in computing-intensive applications (simulations, etc.), and it makes it
1169easier to use the \code{pls} package as a building block in other packages.
1170
1171The plotting facilities are implemented similarly: the \code{plot} method
1172simply calls the correct plot function based on the \code{plottype}
1173argument.  Here, however, the separate plot functions are meant to be
1174callable interactively, because some people like to use the generic
1175\code{plot} function, while others like to use separate functions for each
1176plot type.  There are also \code{plot} methods for some of the components of
1177fitted models that can be extracted with extract functions, like score and
1178loading matrices.  Thus there are several ways to get some plots, e.g.:
1179\begin{verbatim}
1180plot(mymod, plottype = "scores", ...)
1181scoreplot(mymod, ...)
1182plot(scores(mymod), ...)
1183\end{verbatim}
1184
1185One example of a package that uses \pkg{pls} is \pkg{lspls}, available on
1186CRAN.  In that package LS is combined with PLS in a regression procedure.
1187It calls the fit functions of \pkg{pls} directly, and also uses the plot
1188functions to construct score and loading plots.  There is also the
1189\code{plsgenomics} package, which includes a modified version of (an earlier
1190version of) the SIMPLS fit function \code{simpls.fit}.
1191
1192
1193\subsection{Calling fit functions directly}\label{sec:call-fit-func}
1194%%%%%%%%%%%%%%%%
1195The underlying fit functions are called \code{kernelpls.fit},
1196\code{oscorespls.fit}, and \code{simpls.fit} for the PLSR methods, and
1197\code{svdpc.fit} for the PCR method.  They all take arguments \code{X},
1198\code{Y}, \code{ncomp}, and \code{stripped}.  Arguments \code{X}, \code{Y},
1199and \code{ncomp} specify $\bX$ and $\bY$ (as matrices, not data
1200frames), and the number of components to fit, respectively.  The argument
1201\code{stripped} defaults to \code{FALSE}.  When it is \code{TRUE}, the
1202calculations are stripped down to the bare minimum required for returning
1203the $\bX$ means, $\bY$ means, and the regression coefficients.  This is used
1204to speed up cross-validation procedures.
1205
1206The fit functions can be called directly, for instance when one wants to
1207avoid the overhead of formula and data handling in repeated fits.  As an
1208example, this is how a simple leave-one-out cross-validation for a
1209uni-response-model could be implemented, using the SIMPLS:
1210<<>>=
1211X <- gasTrain$NIR
1212Y <- gasTrain$octane
1213ncomp <- 5
1214cvPreds <- matrix(nrow = nrow(X), ncol = ncomp)
1215for (i in 1:nrow(X)) {
1216    fit <- simpls.fit(X[-i,], Y[-i], ncomp = ncomp, stripped = TRUE)
1217    cvPreds[i,] <- (X[i,] - fit$Xmeans) %*% drop(fit$coefficients) +
1218        fit$Ymeans
1219}
1220@
1221The RMSEP of the cross-validated predictions are
1222<<>>=
1223sqrt(colMeans((cvPreds - Y)^2))
1224@
1225which can be seen to be the same as the (unadjusted) CV results for the
1226\code{gas1} model in Section~\ref{sec:example-session}.
1227
1228
1229\subsection{Formula handling in more detail}\label{sec:formula-handling}
1230%%%%%%%%%%%%%%%%
1231The handling of formulas and variables in the model fitting is very similar
1232to what happens in the function \code{lm}: The variables specified in the
1233formula are looked up in the data frame given in the \code{data} argument of
1234the fit function (\code{plsr}, \code{pcr} or \code{mvr}), or in the calling
1235environment if not found in the data frame.  Factors are coded into one or
1236more of columns, depending on the number of levels, and on the contrasts
1237option.  All (possibly coded) variables are then collected in a numerical
1238model matrix.  This matrix is then handed to the underlying fit or
1239cross-validation functions.  A similar handling is used in the \code{predict}
1240method.
1241
1242The intercept is treated specially in \pkg{pls}.  After the model matrix has
1243been constructed, the intercept column is removed.  This ensures that any
1244factors are coded as if the intercept was present.  The underlying fit
1245functions then center the rest of the variables as a part of the fitting
1246process.  (This is intrinsic to the PLSR and PCR algorithms.)  The intercept
1247is handled separately.  A consequence of this is that explicitly specifying
1248formulas without the intercept (e.g., \code{y \textasciitilde\ a + b - 1})
1249will only result in the coding of any factors to change; the intercept will
1250still be fitted.
1251
1252
1253%%%%%%%%%%%%%%%%
1254\bibliographystyle{plain}
1255\bibliography{pls-manual}
1256
1257
1258%%%%%%%%%%%%%%%%
1259\end{document}
1260