1%\VignetteIndexEntry{NMF: generating heatmaps}
2%\VignetteDepends{utils,NMF,RColorBrewer,knitr,bibtex}
3%\VignetteKeyword{aplot}
4%\VignetteCompiler{knitr}
5%\VignetteEngine{knitr::knitr}
6
7\documentclass[a4paper]{article}
8
9%\usepackage[OT1]{fontenc}
10\usepackage[colorlinks]{hyperref}
11\usepackage{a4wide}
12\usepackage{xspace}
13\usepackage[all]{hypcap} % for linking to the top of the figures or tables
14
15% add preamble from pkgmaker
16<<pkgmaker_preamble, echo=FALSE, results='asis'>>=
17pkgmaker::latex_preamble()
18@
19
20\newcommand{\nmfpack}{\pkgname{NMF}}
21\newcommand{\MATLAB}{MATLAB\textsuperscript{\textregistered}\xspace}
22\newcommand{\refeqn}[1]{(\ref{#1})}
23
24% REFERENCES
25\usepackage[citestyle=authoryear-icomp
26, doi=true
27, url=true
28, maxnames=1
29, maxbibnames=15
30, backref=true
31, backend=bibtex]{biblatex}
32\AtEveryCitekey{\clearfield{url}}
33<<bibliofile, echo=FALSE, results='asis'>>=
34pkgmaker::latex_bibliography('NMF')
35@
36\newcommand{\citet}[1]{\textcite{#1}}
37\renewcommand{\cite}[1]{\parencite{#1}}
38\DefineBibliographyStrings{english}{%
39    backrefpage  = {see p.}, % for single page number
40    backrefpages = {see pp.} % for multiple page numbers
41}
42%
43
44% boxed figures
45\usepackage{float}
46\floatstyle{boxed}
47\restylefloat{figure}
48
49\usepackage{array}
50\usepackage{tabularx}
51\usepackage{mathabx}
52
53\usepackage{url}
54\urlstyle{rm}
55
56% use cleveref for automatic reference label formatting
57\usepackage[capitalise, noabbrev]{cleveref}
58
59% define commands for notes
60\usepackage{todonotes}
61\newcommand{\nbnote}[1]{\ \bigskip\todo[inline, backgroundcolor=blue!20!white]{\scriptsize\textsf{\textbf{NB:} #1}}\ \\}
62
63% put table of contents on two columns
64\usepackage[toc]{multitoc}
65
66\setkeys{Gin}{width=0.95\textwidth}
67
68\begin{document}
69
70<<options, include=FALSE, verbose=TRUE>>=
71#options(prompt=' ')
72#options(continue=' ')
73set.seed(123456)
74library(NMF)
75@
76
77\title{Generating heatmaps for Nonnegative Matrix Factorization\\
78\small Package \nmfpack\ - Version \Sexpr{utils::packageVersion('NMF')}}
79\author{Renaud Gaujoux}
80
81\maketitle
82
83\begin{abstract}
84This vignette describes how to produce different informative heatmaps from NMF objects,
85such as returned by the function \code{nmf} in the \citeCRANpkg{NMF}.
86The main drawing engine is implemented by the function \code{aheatmap}, which is
87a highly enhanced modification of the function \code{pheatmap} from the \CRANpkg{pheatmap},
88and provides convenient and quick ways of producing high quality and customizable annotated heatmaps.
89Currently this function is part of the package \nmfpack, but may eventually
90compose a separate package on its own.
91\end{abstract}
92
93{\small \tableofcontents}
94
95\section{Preliminaries}
96
97\subsection{Quick reminder on NMF models}
98
99Given a nonnegative target matrix $X$ of dimension $n\times p$, NMF algorithms
100aim at finding a rank $k$ approximation of the form:
101$$
102X \approx W H,
103$$
104where $W$ and $H$ are nonnegative matrices of dimensions $n\times k$ and $k\times p$
105respectively.
106
107The matrix $W$ is the basis matrix, whose columns are the basis components.
108The matrix $H$ is the mixture coefficient or weight matrix, whose columns contain
109the contribution of each basis component to the corresponding column of $X$.
110We call the rows of $H$ the basis profiles.
111
112\subsection{Heatmaps for NMF}
113
114Because NMF objects essentially wrap up a pair of matrices, heatmaps are convenient
115to visualise the results of NMF runs.
116The package \nmfpack provides several specialised heatmap functions, designed to produce
117heatmaps with sensible default configurations according to the data being drawn.
118Being all based on a common drawing engine, they share almost identical interfaces
119and capabilities.
120The following specialised functions are currently implemented:
121
122\begin{description}
123\item[\code{basismap}] draws heatmaps of the basis matrix
124\item[\code{coefmap}] draws heatmaps of the mixture coefficient matrix
125\item[\code{consensusmap}] draws heatmaps of the consensus matrix, for results
126of multiple NMF runs.
127\end{description}
128
129\subsection{Heatmap engine}
130
131All the above functions eventually call a common heatmap engine, with
132different default parameters, chosen to be relevant for the given underlying data.
133The engine is implemented by the function \code{aheatmap}.
134Its development started as modification of the function \code{pheatmap} from
135the \pkgname{pheatmap} package.
136The initial objective was to improve and increase its capabilities, as well as
137defining a simplified interface, more consistent with the R core function \code{heatmap}.
138We eventually aim at providing a general, flexible, powerful and easy to use engine
139for drawing annotated heatmaps.
140
141The function \code{aheatmap} has many advantages compared to other heatmap functions
142such as \code{heatmap}, \code{gplots::heatmap2}, \code{heatmap.plus::heatmap.plus}
143, or even \code{pheatmap}:
144
145\begin{itemize}
146\item Annotations: unlimited number of annotation tracks can be added to
147\emph{both} columns and rows, with automated colouring for categorical and
148numeric variables.
149\item Compatibility with both base and grid graphics: the function can be
150directly called in drawing contexts such as grid, mfrow or layout.
151This is a feature many R users were looking for, and that was strictly
152impossible with base heatmaps.
153\item Legends: default automatic legend and colouring;
154\item Customisation: clustering methods, annotations, colours and legend can all
155be customised, even separately for rows and columns;
156\item Convenient interface: many arguments provide multiple ways of
157specifying their value(s), which speeds up developping/writing and reduce the
158amount of code required to generate customised plots (e.g. see
159\cref{sec:colour_spec}).
160\item Aesthetics: the heatmaps look globally cleaner, the image and text components
161are by default well proportioned relatively to each other, and all fit within
162the graphic device.
163\end{itemize}
164
165\subsection{Data and model}
166\label{sec:data}
167
168For the purpose of illustrating the use of each heatmap function, we generate a
169random target matrix, as well as some annotations or covariates:
170
171<<data>>=
172# random data that follow an 3-rank NMF model (with quite some noise: sd=2)
173X <- syntheticNMF(100, 3, 20, noise=2)
174
175# row annotations and covariates
176n <- nrow(X)
177d <- rnorm(n)
178e <- unlist(mapply(rep, c('X', 'Y', 'Z'), 10))
179e <- c(e, rep(NA, n-length(e)))
180rdata <- data.frame(Var=d, Type=e)
181
182# column annotations and covariates
183p <- ncol(X)
184a <- sample(c('alpha', 'beta', 'gamma'), p, replace=TRUE)
185# define covariates: true groups and some numeric variable
186c <- rnorm(p)
187# gather them in a data.frame
188covariates <- data.frame(a, X$pData, c)
189@
190
191%\SweaveOpts{fig.width=14,fig.height=7}
192<<figoptions, include=FALSE>>=
193library(knitr)
194opts_chunk$set(fig.width=14, fig.height=7)
195@
196Note that in the code above, the object \code{X} returned by \code{syntheticNMF} \emph{really is} a matrix object, but wrapped through the function \code{ExposedAttribute} object, which exposes its attributes via a more friendly and access controlled interface \code{\$}.
197Of particular interests are attributes \code{'pData'} and \code{'fData'}, which are lists that contain a factor named \code{'Group'} that indicates the true underlying clusters.
198These are respectively defined as each sample's most contrbuting basis component and the basis component to which each feature contributes the most.
199They are useful to annotate heatmaps and assess the ability of NMF methods to recover the true clusters.
200
201As an example, one can conveniently visualize the target matrix as a heatmap, with or without the relevant sample and feature annotations, using simple calls to the \code{aheatmap} function:
202<<heatmap_data>>=
203par(mfrow=c(1,2))
204aheatmap(X, annCol=covariates, annRow=X$fData)
205aheatmap(X)
206@
207
208Then, we fit an NMF model using multiple runs, that will be used throughtout this vignette to illustrate the use of NMF heatmaps:
209
210<<model, cache=TRUE>>=
211res <- nmf(X, 3, nrun=10)
212res
213@
214
215\nbnote{To keep the vignette simple, we always use the default NMF method
216(i.e. \code{'brunet'}), but all steps could be performed using a different method,
217or multiple methods in order to compare their perfromances.}
218
219\section{Mixture Coefficient matrix: \texttt{coefmap}}
220
221The coefficient matrix of the result can be plotted using the function
222\code{coefmap}.
223The default behaviour for multiple NMF runs is to add two annotation tracks that
224show the clusters obtained by the best fit and the hierarchical clustering of
225the consensus matrix\footnote{The hierarchical clustering is computed using the
226consensus matrix itself as a similarity measure, and average linkage. See
227\code{?consensushc}.}.
228In the legend, these tracks are named \emph{basis} and \emph{consensus} respectively.
229For single NMF run or NMF model objects, no consensus data are available, and
230only the clusters from the fit are displayed.
231
232<<coefmap_res, fig.keep='last'>>=
233opar <- par(mfrow=c(1,2))
234# coefmap from multiple run fit: includes a consensus track
235coefmap(res)
236# coefmap of a single run fit: no consensus track
237coefmap(minfit(res))
238par(opar)
239@
240
241\nbnote{Note how both heatmaps were drawn on the same plot, simply using the standard
242call to \code{par(mfrow=c(1,2)}.
243This is impossible to achieve with the R core function \code{heatmap}.
244See \cref{sec:aheatmap} for more details about compatibility with base
245and grid graphics.}
246
247By default:
248\begin{itemize}
249\item the rows are not ordered;
250\item the columns use the default ordering of \code{aheatmap}, but may easily be
251ordered according to the clusters defined by the dominant basis component for
252each column with \code{Colv="basis"}, or according to those implied by the
253consensus matrix, i.e. as in \code{consensusmap}, with \code{Colv="consensus"};
254\item each column is scaled to sum up to one;
255\item the color palette used is \code{'YlOrRd'} from the
256\citeCRANpkg{RColorBrewer}, with 50 breaks.
257\end{itemize}
258
259In term of arguments passed to the heatmap engine \code{aheatmap}, these default
260settings translate as:
261
262<<coefmap_default, eval=FALSE>>=
263Rowv = NA
264Colv = TRUE
265scale = 'c1'
266color = 'YlOrRd:50'
267annCol = predict(object) + predict(object, 'consensus')
268@
269
270If the ordering does not come from a hierarchical clustering (e.g., if
271\code{Colv='basis'}), then no dendrogram is displayed.
272The default behaviour of \code{aheatmap} can be obtained by setting arguments
273\code{Rowv=TRUE, Colv=TRUE, scale='none'}.
274
275\medskip
276The automatic annotation tracks can be hidden all together by setting argument
277\code{tracks=NA}, displayed separately by passing only one of the given names
278(e.g. \code{tracks=':basis'} or \code{tracks='basis:'} for the row or column respectively),
279and their legend names may be changed by
280specifying e.g. \code{tracks=c(Metagene=':basis', 'consensus')}.
281Beside this, they are handled by the heatmap engine function \code{aheatmap}
282and can be customised as any other annotation tracks -- that can be added via
283the same argument \code{annCol} (see \cref{sec:aheatmap} or \code{?aheatmap} for
284more details).
285
286<<coefmap_custom, fig.keep='last', tidy=FALSE>>=
287opar <- par(mfrow=c(1,2))
288# removing all automatic annotation tracks
289coefmap(res, tracks=NA)
290# customized plot
291coefmap(res, Colv = 'euclidean'
292	, main = "Metagene contributions in each sample", labCol = NULL
293	, annRow = list(Metagene=':basis'), annCol = list(':basis', Class=a, Index=c)
294	, annColors = list(Metagene='Set2')
295	, info = TRUE)
296par(opar)
297@
298
299\nbnote{The feature that allows to display some information about the fit
300at the bottom of the plot via argument \code{info=TRUE} is still experimental.
301It is helpful mostly when developing algorithms or doing an analysis, but
302would seldom be used in publications.}
303
304\section{Basis matrix: \texttt{basismap}}
305
306The basis matrix can be plotted using the function \code{basismap}.
307The default behaviour is to add an annotation track that shows for each row
308the dominant basis component.
309That is, for each row, the index of the basis component with the highest
310loading.
311
312This track can be disabled by setting \code{tracks=NA}, and extra
313row annotations can be added using the same argument \code{annRow}.
314
315<<basismap_res, fig.keep='last'>>=
316opar <- par(mfrow=c(1,2))
317# default plot
318basismap(res)
319# customized plot: only use row special annotation track.
320basismap(res, main="Metagenes", annRow=list(d, e), tracks=c(Metagene=':basis'))
321par(opar)
322@
323
324By default:
325\begin{itemize}
326\item the columns are not ordered;
327\item the rows are ordered by hierarchical clustering using default distance and
328linkage methods (\code{'eculidean'} and \code{'complete'});
329\item each row is scaled to sum up to one;
330\item the color palette used is \code{'YlOrRd'} from the
331\citeCRANpkg{RColorBrewer}, with 50 breaks.
332\end{itemize}
333
334In term of arguments passed to the heatmap engine \code{aheatmap}, these default
335settings translate as:
336
337<<basismap_default, eval=FALSE>>=
338Colv = NA
339scale = 'r1'
340color = 'YlOrRd:50'
341annRow = predict(object, 'features')
342@
343
344\section{Consensus matrix: \texttt{consensusmap}}
345
346When doing clustering with NMF, a common way of assessing the stability of the
347clusters obtained for a given rank is to consider the consensus matrix
348computed over multiple independent NMF runs, which is the average of the connectivity
349matrices of each separate run
350\footnote{Hence, stability here means robustness with regards to the initial starting point,
351and shall not be interpreted as in e.g. cross-validation/bootstrap analysis.
352However, one can argue that having very consistent clusters across runs somehow supports
353for a certain regularity or the presence of an underlying pattern in the data.}.
354This procedure is usually repeated over a certain range of factorization ranks,
355and the results are compared to identify which rank gives the best clusters,
356possibly in the light of some extra knowledge one could have about the samples
357(e.g. covariates).
358The functions \code{nmf} and \code{consensusmap} make it easy to implement
359this whole process.
360
361\nbnote{The consensus plots can also be generated for fits obtained from single NMF runs,
362in which case the consensus matrix simply reduces to a single connectivity matrix.
363This is a binary matrix (i.e. entries are either 0 or 1), that will always produce
364a bi-colour heatmap, and by default clear blocks for each cluster.}
365
366\subsection{Single fit}
367In section \cref{sec:data}, the NMF fit \code{res} was computed with argument
368\code{nrun=10}, and therefore contains the best fit over 10 runs, as well as
369the consensus matrix computed over all the runs
370\footnote{If one were interested in keeping the fits from all the runs, the
371function \code{nmf} should have been called with argument \code{.options='k'}.
372See section \emph{Options} in \code{?nmf}.
373The downstream hanlding of the result would remain identical.}.
374This can be ploted using the function \code{consensusmap}, which allows for the
375same kind of customization as the other NMF heatmap functions:
376
377<<consensusmap_res, fig.keep='last'>>=
378opar <- par(mfrow=c(1,2))
379# default plot
380consensusmap(res)
381# customized plot
382consensusmap(res, annCol=covariates, annColors=list(c='blue')
383		, labCol='sample ', main='Cluster stability'
384		, sub='Consensus matrix and all covariates')
385par(opar)
386@
387
388By default:
389\begin{itemize}
390\item the rows and columns of the consensus heatmap are symmetrically
391ordered by hierarchical clustering using the consensus matrix as a similarity
392measure and average linkage, and the associated dendrogram is displayed;
393\item the color palette used is the reverse of \code{'RdYlBu'} from the \citeCRANpkg{RColorBrewer}.
394\end{itemize}
395
396In term of arguments passed to the heatmap engine \code{aheatmap}, these default
397settings translate as:
398
399<<cmap_default, eval=FALSE>>=
400distfun = function(x) as.dist(1-x) # x being the consensus matrix
401hclustfun = 'average'
402Rowv = TRUE
403Colv = "Rowv"
404color = '-RdYlBu'
405@
406
407\subsection{Single method over a range of ranks}
408
409The function \code{nmf} accepts a range of value for the rank (argument \code{rank}),
410making it fit NMF models for each value in the given range
411\footnote{Before version 0.6, this feature was provided by the function \code{nmfEstimateRank}.
412From version 0.6, the function \code{nmf} accepts ranges of ranks, and internally
413calls the function \code{nmfEstimateRank} -- that remains exported and can still
414be called directly.
415See documentation \code{?nmfEstimateRank} for more details on the returned value.}:
416
417<<estimate, cache=TRUE>>=
418res2_7 <- nmf(X, 2:7, nrun=10, .options='v')
419class(res2_7)
420@
421
422The result \code{res2\_7} is an S3 object of class \code{'NMF.rank'},
423that contains -- amongst other data -- a list of the best fits obtained for each
424value of the rank in range $\ldbrack 2, 7\rdbrack]$.
425The method of \code{consensusmap} defined for class \code{'NMF.rank'},
426which plots all the consensus matrices on the same plot:
427
428<<consensusmap_estimate, fig.keep='last'>>=
429consensusmap(res2_7)
430@
431
432\nbnote{ The main title of each consensus heatmap can be customized by passing
433to argument \code{main} a character vector or a list whose elements specify each title.
434All other arguments are used in each internal call to consensusmap, and will
435therefore affect all the plots simultaneously.
436The layout can be specified via argument \code{layout} as a numeric vector
437giving the number of rows and columns in a \code{mfrow}-like way, or as a
438matrix that will be passed to R core function \code{layout}.
439See \code{?consensusmap} for more details and example code.
440}
441
442\subsection{Single rank over a range of methods}
443If one is interested in comparing methods, for a given factorization rank, then
444on can fit an NMF model for each method by providing the function \code{nmf} with
445a \code{list} in argument \code{method}:
446
447<<fit_methods, cache=TRUE>>=
448res_methods <- nmf(X, 3, list('lee', 'brunet', 'nsNMF'), nrun=10)
449class(res_methods)
450@
451
452The result \code{res\_methods} is an S4 object of class \code{NMFList}, which
453is essentially a named list, that contains each fits and the CPU time required
454by the whole computation.
455As previously, the sequence of consensus matrices is plotted with \code{consensusmap}:
456
457<<consensusmap_methods, fig.width=10, fig.height=7, fig.keep='last'>>=
458consensusmap(res_methods)
459@
460
461\section{Generic heatmap engine: \texttt{aheatmap}}
462\label{sec:aheatmap}
463
464This section still needs to be written, but many examples of annotated heatmaps can be found in the demos \code{'aheatmap'} and \code{'heatmaps'}:
465<<demo_hm, eval=FALSE>>=
466demo('aheatmap')
467# or
468demo('heatmaps')
469@
470
471These demos and the plots they generate can also be browsed online at \url{http://nmf.r-forge.r-project.org/_DEMOS.html}.
472
473\section{Session Info}
474<<sessionInfo, echo=FALSE, results='asis'>>=
475toLatex(sessionInfo())
476@
477
478\printbibliography[heading=bibintoc]
479
480
481\end{document}
482