1%\VignetteIndexEntry{NMF: generating heatmaps} 2%\VignetteDepends{utils,NMF,RColorBrewer,knitr,bibtex} 3%\VignetteKeyword{aplot} 4%\VignetteCompiler{knitr} 5%\VignetteEngine{knitr::knitr} 6 7\documentclass[a4paper]{article} 8 9%\usepackage[OT1]{fontenc} 10\usepackage[colorlinks]{hyperref} 11\usepackage{a4wide} 12\usepackage{xspace} 13\usepackage[all]{hypcap} % for linking to the top of the figures or tables 14 15% add preamble from pkgmaker 16<<pkgmaker_preamble, echo=FALSE, results='asis'>>= 17pkgmaker::latex_preamble() 18@ 19 20\newcommand{\nmfpack}{\pkgname{NMF}} 21\newcommand{\MATLAB}{MATLAB\textsuperscript{\textregistered}\xspace} 22\newcommand{\refeqn}[1]{(\ref{#1})} 23 24% REFERENCES 25\usepackage[citestyle=authoryear-icomp 26, doi=true 27, url=true 28, maxnames=1 29, maxbibnames=15 30, backref=true 31, backend=bibtex]{biblatex} 32\AtEveryCitekey{\clearfield{url}} 33<<bibliofile, echo=FALSE, results='asis'>>= 34pkgmaker::latex_bibliography('NMF') 35@ 36\newcommand{\citet}[1]{\textcite{#1}} 37\renewcommand{\cite}[1]{\parencite{#1}} 38\DefineBibliographyStrings{english}{% 39 backrefpage = {see p.}, % for single page number 40 backrefpages = {see pp.} % for multiple page numbers 41} 42% 43 44% boxed figures 45\usepackage{float} 46\floatstyle{boxed} 47\restylefloat{figure} 48 49\usepackage{array} 50\usepackage{tabularx} 51\usepackage{mathabx} 52 53\usepackage{url} 54\urlstyle{rm} 55 56% use cleveref for automatic reference label formatting 57\usepackage[capitalise, noabbrev]{cleveref} 58 59% define commands for notes 60\usepackage{todonotes} 61\newcommand{\nbnote}[1]{\ \bigskip\todo[inline, backgroundcolor=blue!20!white]{\scriptsize\textsf{\textbf{NB:} #1}}\ \\} 62 63% put table of contents on two columns 64\usepackage[toc]{multitoc} 65 66\setkeys{Gin}{width=0.95\textwidth} 67 68\begin{document} 69 70<<options, include=FALSE, verbose=TRUE>>= 71#options(prompt=' ') 72#options(continue=' ') 73set.seed(123456) 74library(NMF) 75@ 76 77\title{Generating heatmaps for Nonnegative Matrix Factorization\\ 78\small Package \nmfpack\ - Version \Sexpr{utils::packageVersion('NMF')}} 79\author{Renaud Gaujoux} 80 81\maketitle 82 83\begin{abstract} 84This vignette describes how to produce different informative heatmaps from NMF objects, 85such as returned by the function \code{nmf} in the \citeCRANpkg{NMF}. 86The main drawing engine is implemented by the function \code{aheatmap}, which is 87a highly enhanced modification of the function \code{pheatmap} from the \CRANpkg{pheatmap}, 88and provides convenient and quick ways of producing high quality and customizable annotated heatmaps. 89Currently this function is part of the package \nmfpack, but may eventually 90compose a separate package on its own. 91\end{abstract} 92 93{\small \tableofcontents} 94 95\section{Preliminaries} 96 97\subsection{Quick reminder on NMF models} 98 99Given a nonnegative target matrix $X$ of dimension $n\times p$, NMF algorithms 100aim at finding a rank $k$ approximation of the form: 101$$ 102X \approx W H, 103$$ 104where $W$ and $H$ are nonnegative matrices of dimensions $n\times k$ and $k\times p$ 105respectively. 106 107The matrix $W$ is the basis matrix, whose columns are the basis components. 108The matrix $H$ is the mixture coefficient or weight matrix, whose columns contain 109the contribution of each basis component to the corresponding column of $X$. 110We call the rows of $H$ the basis profiles. 111 112\subsection{Heatmaps for NMF} 113 114Because NMF objects essentially wrap up a pair of matrices, heatmaps are convenient 115to visualise the results of NMF runs. 116The package \nmfpack provides several specialised heatmap functions, designed to produce 117heatmaps with sensible default configurations according to the data being drawn. 118Being all based on a common drawing engine, they share almost identical interfaces 119and capabilities. 120The following specialised functions are currently implemented: 121 122\begin{description} 123\item[\code{basismap}] draws heatmaps of the basis matrix 124\item[\code{coefmap}] draws heatmaps of the mixture coefficient matrix 125\item[\code{consensusmap}] draws heatmaps of the consensus matrix, for results 126of multiple NMF runs. 127\end{description} 128 129\subsection{Heatmap engine} 130 131All the above functions eventually call a common heatmap engine, with 132different default parameters, chosen to be relevant for the given underlying data. 133The engine is implemented by the function \code{aheatmap}. 134Its development started as modification of the function \code{pheatmap} from 135the \pkgname{pheatmap} package. 136The initial objective was to improve and increase its capabilities, as well as 137defining a simplified interface, more consistent with the R core function \code{heatmap}. 138We eventually aim at providing a general, flexible, powerful and easy to use engine 139for drawing annotated heatmaps. 140 141The function \code{aheatmap} has many advantages compared to other heatmap functions 142such as \code{heatmap}, \code{gplots::heatmap2}, \code{heatmap.plus::heatmap.plus} 143, or even \code{pheatmap}: 144 145\begin{itemize} 146\item Annotations: unlimited number of annotation tracks can be added to 147\emph{both} columns and rows, with automated colouring for categorical and 148numeric variables. 149\item Compatibility with both base and grid graphics: the function can be 150directly called in drawing contexts such as grid, mfrow or layout. 151This is a feature many R users were looking for, and that was strictly 152impossible with base heatmaps. 153\item Legends: default automatic legend and colouring; 154\item Customisation: clustering methods, annotations, colours and legend can all 155be customised, even separately for rows and columns; 156\item Convenient interface: many arguments provide multiple ways of 157specifying their value(s), which speeds up developping/writing and reduce the 158amount of code required to generate customised plots (e.g. see 159\cref{sec:colour_spec}). 160\item Aesthetics: the heatmaps look globally cleaner, the image and text components 161are by default well proportioned relatively to each other, and all fit within 162the graphic device. 163\end{itemize} 164 165\subsection{Data and model} 166\label{sec:data} 167 168For the purpose of illustrating the use of each heatmap function, we generate a 169random target matrix, as well as some annotations or covariates: 170 171<<data>>= 172# random data that follow an 3-rank NMF model (with quite some noise: sd=2) 173X <- syntheticNMF(100, 3, 20, noise=2) 174 175# row annotations and covariates 176n <- nrow(X) 177d <- rnorm(n) 178e <- unlist(mapply(rep, c('X', 'Y', 'Z'), 10)) 179e <- c(e, rep(NA, n-length(e))) 180rdata <- data.frame(Var=d, Type=e) 181 182# column annotations and covariates 183p <- ncol(X) 184a <- sample(c('alpha', 'beta', 'gamma'), p, replace=TRUE) 185# define covariates: true groups and some numeric variable 186c <- rnorm(p) 187# gather them in a data.frame 188covariates <- data.frame(a, X$pData, c) 189@ 190 191%\SweaveOpts{fig.width=14,fig.height=7} 192<<figoptions, include=FALSE>>= 193library(knitr) 194opts_chunk$set(fig.width=14, fig.height=7) 195@ 196Note that in the code above, the object \code{X} returned by \code{syntheticNMF} \emph{really is} a matrix object, but wrapped through the function \code{ExposedAttribute} object, which exposes its attributes via a more friendly and access controlled interface \code{\$}. 197Of particular interests are attributes \code{'pData'} and \code{'fData'}, which are lists that contain a factor named \code{'Group'} that indicates the true underlying clusters. 198These are respectively defined as each sample's most contrbuting basis component and the basis component to which each feature contributes the most. 199They are useful to annotate heatmaps and assess the ability of NMF methods to recover the true clusters. 200 201As an example, one can conveniently visualize the target matrix as a heatmap, with or without the relevant sample and feature annotations, using simple calls to the \code{aheatmap} function: 202<<heatmap_data>>= 203par(mfrow=c(1,2)) 204aheatmap(X, annCol=covariates, annRow=X$fData) 205aheatmap(X) 206@ 207 208Then, we fit an NMF model using multiple runs, that will be used throughtout this vignette to illustrate the use of NMF heatmaps: 209 210<<model, cache=TRUE>>= 211res <- nmf(X, 3, nrun=10) 212res 213@ 214 215\nbnote{To keep the vignette simple, we always use the default NMF method 216(i.e. \code{'brunet'}), but all steps could be performed using a different method, 217or multiple methods in order to compare their perfromances.} 218 219\section{Mixture Coefficient matrix: \texttt{coefmap}} 220 221The coefficient matrix of the result can be plotted using the function 222\code{coefmap}. 223The default behaviour for multiple NMF runs is to add two annotation tracks that 224show the clusters obtained by the best fit and the hierarchical clustering of 225the consensus matrix\footnote{The hierarchical clustering is computed using the 226consensus matrix itself as a similarity measure, and average linkage. See 227\code{?consensushc}.}. 228In the legend, these tracks are named \emph{basis} and \emph{consensus} respectively. 229For single NMF run or NMF model objects, no consensus data are available, and 230only the clusters from the fit are displayed. 231 232<<coefmap_res, fig.keep='last'>>= 233opar <- par(mfrow=c(1,2)) 234# coefmap from multiple run fit: includes a consensus track 235coefmap(res) 236# coefmap of a single run fit: no consensus track 237coefmap(minfit(res)) 238par(opar) 239@ 240 241\nbnote{Note how both heatmaps were drawn on the same plot, simply using the standard 242call to \code{par(mfrow=c(1,2)}. 243This is impossible to achieve with the R core function \code{heatmap}. 244See \cref{sec:aheatmap} for more details about compatibility with base 245and grid graphics.} 246 247By default: 248\begin{itemize} 249\item the rows are not ordered; 250\item the columns use the default ordering of \code{aheatmap}, but may easily be 251ordered according to the clusters defined by the dominant basis component for 252each column with \code{Colv="basis"}, or according to those implied by the 253consensus matrix, i.e. as in \code{consensusmap}, with \code{Colv="consensus"}; 254\item each column is scaled to sum up to one; 255\item the color palette used is \code{'YlOrRd'} from the 256\citeCRANpkg{RColorBrewer}, with 50 breaks. 257\end{itemize} 258 259In term of arguments passed to the heatmap engine \code{aheatmap}, these default 260settings translate as: 261 262<<coefmap_default, eval=FALSE>>= 263Rowv = NA 264Colv = TRUE 265scale = 'c1' 266color = 'YlOrRd:50' 267annCol = predict(object) + predict(object, 'consensus') 268@ 269 270If the ordering does not come from a hierarchical clustering (e.g., if 271\code{Colv='basis'}), then no dendrogram is displayed. 272The default behaviour of \code{aheatmap} can be obtained by setting arguments 273\code{Rowv=TRUE, Colv=TRUE, scale='none'}. 274 275\medskip 276The automatic annotation tracks can be hidden all together by setting argument 277\code{tracks=NA}, displayed separately by passing only one of the given names 278(e.g. \code{tracks=':basis'} or \code{tracks='basis:'} for the row or column respectively), 279and their legend names may be changed by 280specifying e.g. \code{tracks=c(Metagene=':basis', 'consensus')}. 281Beside this, they are handled by the heatmap engine function \code{aheatmap} 282and can be customised as any other annotation tracks -- that can be added via 283the same argument \code{annCol} (see \cref{sec:aheatmap} or \code{?aheatmap} for 284more details). 285 286<<coefmap_custom, fig.keep='last', tidy=FALSE>>= 287opar <- par(mfrow=c(1,2)) 288# removing all automatic annotation tracks 289coefmap(res, tracks=NA) 290# customized plot 291coefmap(res, Colv = 'euclidean' 292 , main = "Metagene contributions in each sample", labCol = NULL 293 , annRow = list(Metagene=':basis'), annCol = list(':basis', Class=a, Index=c) 294 , annColors = list(Metagene='Set2') 295 , info = TRUE) 296par(opar) 297@ 298 299\nbnote{The feature that allows to display some information about the fit 300at the bottom of the plot via argument \code{info=TRUE} is still experimental. 301It is helpful mostly when developing algorithms or doing an analysis, but 302would seldom be used in publications.} 303 304\section{Basis matrix: \texttt{basismap}} 305 306The basis matrix can be plotted using the function \code{basismap}. 307The default behaviour is to add an annotation track that shows for each row 308the dominant basis component. 309That is, for each row, the index of the basis component with the highest 310loading. 311 312This track can be disabled by setting \code{tracks=NA}, and extra 313row annotations can be added using the same argument \code{annRow}. 314 315<<basismap_res, fig.keep='last'>>= 316opar <- par(mfrow=c(1,2)) 317# default plot 318basismap(res) 319# customized plot: only use row special annotation track. 320basismap(res, main="Metagenes", annRow=list(d, e), tracks=c(Metagene=':basis')) 321par(opar) 322@ 323 324By default: 325\begin{itemize} 326\item the columns are not ordered; 327\item the rows are ordered by hierarchical clustering using default distance and 328linkage methods (\code{'eculidean'} and \code{'complete'}); 329\item each row is scaled to sum up to one; 330\item the color palette used is \code{'YlOrRd'} from the 331\citeCRANpkg{RColorBrewer}, with 50 breaks. 332\end{itemize} 333 334In term of arguments passed to the heatmap engine \code{aheatmap}, these default 335settings translate as: 336 337<<basismap_default, eval=FALSE>>= 338Colv = NA 339scale = 'r1' 340color = 'YlOrRd:50' 341annRow = predict(object, 'features') 342@ 343 344\section{Consensus matrix: \texttt{consensusmap}} 345 346When doing clustering with NMF, a common way of assessing the stability of the 347clusters obtained for a given rank is to consider the consensus matrix 348computed over multiple independent NMF runs, which is the average of the connectivity 349matrices of each separate run 350\footnote{Hence, stability here means robustness with regards to the initial starting point, 351and shall not be interpreted as in e.g. cross-validation/bootstrap analysis. 352However, one can argue that having very consistent clusters across runs somehow supports 353for a certain regularity or the presence of an underlying pattern in the data.}. 354This procedure is usually repeated over a certain range of factorization ranks, 355and the results are compared to identify which rank gives the best clusters, 356possibly in the light of some extra knowledge one could have about the samples 357(e.g. covariates). 358The functions \code{nmf} and \code{consensusmap} make it easy to implement 359this whole process. 360 361\nbnote{The consensus plots can also be generated for fits obtained from single NMF runs, 362in which case the consensus matrix simply reduces to a single connectivity matrix. 363This is a binary matrix (i.e. entries are either 0 or 1), that will always produce 364a bi-colour heatmap, and by default clear blocks for each cluster.} 365 366\subsection{Single fit} 367In section \cref{sec:data}, the NMF fit \code{res} was computed with argument 368\code{nrun=10}, and therefore contains the best fit over 10 runs, as well as 369the consensus matrix computed over all the runs 370\footnote{If one were interested in keeping the fits from all the runs, the 371function \code{nmf} should have been called with argument \code{.options='k'}. 372See section \emph{Options} in \code{?nmf}. 373The downstream hanlding of the result would remain identical.}. 374This can be ploted using the function \code{consensusmap}, which allows for the 375same kind of customization as the other NMF heatmap functions: 376 377<<consensusmap_res, fig.keep='last'>>= 378opar <- par(mfrow=c(1,2)) 379# default plot 380consensusmap(res) 381# customized plot 382consensusmap(res, annCol=covariates, annColors=list(c='blue') 383 , labCol='sample ', main='Cluster stability' 384 , sub='Consensus matrix and all covariates') 385par(opar) 386@ 387 388By default: 389\begin{itemize} 390\item the rows and columns of the consensus heatmap are symmetrically 391ordered by hierarchical clustering using the consensus matrix as a similarity 392measure and average linkage, and the associated dendrogram is displayed; 393\item the color palette used is the reverse of \code{'RdYlBu'} from the \citeCRANpkg{RColorBrewer}. 394\end{itemize} 395 396In term of arguments passed to the heatmap engine \code{aheatmap}, these default 397settings translate as: 398 399<<cmap_default, eval=FALSE>>= 400distfun = function(x) as.dist(1-x) # x being the consensus matrix 401hclustfun = 'average' 402Rowv = TRUE 403Colv = "Rowv" 404color = '-RdYlBu' 405@ 406 407\subsection{Single method over a range of ranks} 408 409The function \code{nmf} accepts a range of value for the rank (argument \code{rank}), 410making it fit NMF models for each value in the given range 411\footnote{Before version 0.6, this feature was provided by the function \code{nmfEstimateRank}. 412From version 0.6, the function \code{nmf} accepts ranges of ranks, and internally 413calls the function \code{nmfEstimateRank} -- that remains exported and can still 414be called directly. 415See documentation \code{?nmfEstimateRank} for more details on the returned value.}: 416 417<<estimate, cache=TRUE>>= 418res2_7 <- nmf(X, 2:7, nrun=10, .options='v') 419class(res2_7) 420@ 421 422The result \code{res2\_7} is an S3 object of class \code{'NMF.rank'}, 423that contains -- amongst other data -- a list of the best fits obtained for each 424value of the rank in range $\ldbrack 2, 7\rdbrack]$. 425The method of \code{consensusmap} defined for class \code{'NMF.rank'}, 426which plots all the consensus matrices on the same plot: 427 428<<consensusmap_estimate, fig.keep='last'>>= 429consensusmap(res2_7) 430@ 431 432\nbnote{ The main title of each consensus heatmap can be customized by passing 433to argument \code{main} a character vector or a list whose elements specify each title. 434All other arguments are used in each internal call to consensusmap, and will 435therefore affect all the plots simultaneously. 436The layout can be specified via argument \code{layout} as a numeric vector 437giving the number of rows and columns in a \code{mfrow}-like way, or as a 438matrix that will be passed to R core function \code{layout}. 439See \code{?consensusmap} for more details and example code. 440} 441 442\subsection{Single rank over a range of methods} 443If one is interested in comparing methods, for a given factorization rank, then 444on can fit an NMF model for each method by providing the function \code{nmf} with 445a \code{list} in argument \code{method}: 446 447<<fit_methods, cache=TRUE>>= 448res_methods <- nmf(X, 3, list('lee', 'brunet', 'nsNMF'), nrun=10) 449class(res_methods) 450@ 451 452The result \code{res\_methods} is an S4 object of class \code{NMFList}, which 453is essentially a named list, that contains each fits and the CPU time required 454by the whole computation. 455As previously, the sequence of consensus matrices is plotted with \code{consensusmap}: 456 457<<consensusmap_methods, fig.width=10, fig.height=7, fig.keep='last'>>= 458consensusmap(res_methods) 459@ 460 461\section{Generic heatmap engine: \texttt{aheatmap}} 462\label{sec:aheatmap} 463 464This section still needs to be written, but many examples of annotated heatmaps can be found in the demos \code{'aheatmap'} and \code{'heatmaps'}: 465<<demo_hm, eval=FALSE>>= 466demo('aheatmap') 467# or 468demo('heatmaps') 469@ 470 471These demos and the plots they generate can also be browsed online at \url{http://nmf.r-forge.r-project.org/_DEMOS.html}. 472 473\section{Session Info} 474<<sessionInfo, echo=FALSE, results='asis'>>= 475toLatex(sessionInfo()) 476@ 477 478\printbibliography[heading=bibintoc] 479 480 481\end{document} 482