NMF/vignettes/heatmaps.Rnw

%\VignetteIndexEntry{NMF: generating heatmaps}
%\VignetteDepends{utils,NMF,RColorBrewer,knitr,bibtex}
%\VignetteKeyword{aplot}
%\VignetteCompiler{knitr}
%\VignetteEngine{knitr::knitr}

\documentclass[a4paper]{article}

%\usepackage[OT1]{fontenc}
\usepackage[colorlinks]{hyperref}
\usepackage{a4wide}
\usepackage{xspace}
\usepackage[all]{hypcap} % for linking to the top of the figures or tables

% add preamble from pkgmaker
<<pkgmaker_preamble, echo=FALSE, results='asis'>>=
pkgmaker::latex_preamble()
@

\newcommand{\nmfpack}{\pkgname{NMF}}
\newcommand{\MATLAB}{MATLAB\textsuperscript{\textregistered}\xspace}
\newcommand{\refeqn}[1]{(\ref{#1})}

% REFERENCES
\usepackage[citestyle=authoryear-icomp
, doi=true
, url=true
, maxnames=1
, maxbibnames=15
, backref=true
, backend=bibtex]{biblatex}
\AtEveryCitekey{\clearfield{url}}
<<bibliofile, echo=FALSE, results='asis'>>=
pkgmaker::latex_bibliography('NMF')
@
\newcommand{\citet}[1]{\textcite{#1}}
\renewcommand{\cite}[1]{\parencite{#1}}
\DefineBibliographyStrings{english}{%
    backrefpage  = {see p.}, % for single page number
    backrefpages = {see pp.} % for multiple page numbers
}
%

% boxed figures
\usepackage{float}
\floatstyle{boxed}
\restylefloat{figure}

\usepackage{array}
\usepackage{tabularx}
\usepackage{mathabx}

\usepackage{url}
\urlstyle{rm}

% use cleveref for automatic reference label formatting
\usepackage[capitalise, noabbrev]{cleveref}

% define commands for notes
\usepackage{todonotes}
\newcommand{\nbnote}[1]{\ \bigskip\todo[inline, backgroundcolor=blue!20!white]{\scriptsize\textsf{\textbf{NB:} #1}}\ \\}

% put table of contents on two columns
\usepackage[toc]{multitoc}

\setkeys{Gin}{width=0.95\textwidth}

\begin{document}

<<options, include=FALSE, verbose=TRUE>>=
#options(prompt=' ')
#options(continue=' ')
set.seed(123456)
library(NMF)
@

\title{Generating heatmaps for Nonnegative Matrix Factorization\\
\small Package \nmfpack\ - Version \Sexpr{utils::packageVersion('NMF')}}
\author{Renaud Gaujoux}

\maketitle

\begin{abstract}
This vignette describes how to produce different informative heatmaps from NMF objects,
such as returned by the function \code{nmf} in the \citeCRANpkg{NMF}.
The main drawing engine is implemented by the function \code{aheatmap}, which is
a highly enhanced modification of the function \code{pheatmap} from the \CRANpkg{pheatmap},
and provides convenient and quick ways of producing high quality and customizable annotated heatmaps.
Currently this function is part of the package \nmfpack, but may eventually
compose a separate package on its own.
\end{abstract}

{\small \tableofcontents}

\section{Preliminaries}

\subsection{Quick reminder on NMF models}

Given a nonnegative target matrix $X$ of dimension $n\times p$, NMF algorithms
aim at finding a rank $k$ approximation of the form:
$$
X \approx W H,
$$
where $W$ and $H$ are nonnegative matrices of dimensions $n\times k$ and $k\times p$
respectively.

The matrix $W$ is the basis matrix, whose columns are the basis components.
The matrix $H$ is the mixture coefficient or weight matrix, whose columns contain
the contribution of each basis component to the corresponding column of $X$.
We call the rows of $H$ the basis profiles.

\subsection{Heatmaps for NMF}

Because NMF objects essentially wrap up a pair of matrices, heatmaps are convenient
to visualise the results of NMF runs.
The package \nmfpack provides several specialised heatmap functions, designed to produce
heatmaps with sensible default configurations according to the data being drawn.
Being all based on a common drawing engine, they share almost identical interfaces
and capabilities.
The following specialised functions are currently implemented:

\begin{description}
\item[\code{basismap}] draws heatmaps of the basis matrix
\item[\code{coefmap}] draws heatmaps of the mixture coefficient matrix
\item[\code{consensusmap}] draws heatmaps of the consensus matrix, for results
of multiple NMF runs.
\end{description}

\subsection{Heatmap engine}

All the above functions eventually call a common heatmap engine, with
different default parameters, chosen to be relevant for the given underlying data.
The engine is implemented by the function \code{aheatmap}.
Its development started as modification of the function \code{pheatmap} from
the \pkgname{pheatmap} package.
The initial objective was to improve and increase its capabilities, as well as
defining a simplified interface, more consistent with the R core function \code{heatmap}.
We eventually aim at providing a general, flexible, powerful and easy to use engine
for drawing annotated heatmaps.

The function \code{aheatmap} has many advantages compared to other heatmap functions
such as \code{heatmap}, \code{gplots::heatmap2}, \code{heatmap.plus::heatmap.plus}
, or even \code{pheatmap}:

\begin{itemize}
\item Annotations: unlimited number of annotation tracks can be added to
\emph{both} columns and rows, with automated colouring for categorical and
numeric variables.
\item Compatibility with both base and grid graphics: the function can be
directly called in drawing contexts such as grid, mfrow or layout.
This is a feature many R users were looking for, and that was strictly
impossible with base heatmaps.
\item Legends: default automatic legend and colouring;
\item Customisation: clustering methods, annotations, colours and legend can all
be customised, even separately for rows and columns;
\item Convenient interface: many arguments provide multiple ways of
specifying their value(s), which speeds up developping/writing and reduce the
amount of code required to generate customised plots (e.g. see
\cref{sec:colour_spec}).
\item Aesthetics: the heatmaps look globally cleaner, the image and text components
are by default well proportioned relatively to each other, and all fit within
the graphic device.
\end{itemize}

\subsection{Data and model}
\label{sec:data}

For the purpose of illustrating the use of each heatmap function, we generate a
random target matrix, as well as some annotations or covariates:

<<data>>=
# random data that follow an 3-rank NMF model (with quite some noise: sd=2)
X <- syntheticNMF(100, 3, 20, noise=2)

# row annotations and covariates
n <- nrow(X)
d <- rnorm(n)
e <- unlist(mapply(rep, c('X', 'Y', 'Z'), 10))
e <- c(e, rep(NA, n-length(e)))
rdata <- data.frame(Var=d, Type=e)

# column annotations and covariates
p <- ncol(X)
a <- sample(c('alpha', 'beta', 'gamma'), p, replace=TRUE)
# define covariates: true groups and some numeric variable
c <- rnorm(p)
# gather them in a data.frame
covariates <- data.frame(a, X$pData, c)
@

%\SweaveOpts{fig.width=14,fig.height=7}
<<figoptions, include=FALSE>>=
library(knitr)
opts_chunk$set(fig.width=14, fig.height=7)
@
Note that in the code above, the object \code{X} returned by \code{syntheticNMF} \emph{really is} a matrix object, but wrapped through the function \code{ExposedAttribute} object, which exposes its attributes via a more friendly and access controlled interface \code{\$}.
Of particular interests are attributes \code{'pData'} and \code{'fData'}, which are lists that contain a factor named \code{'Group'} that indicates the true underlying clusters.
These are respectively defined as each sample's most contrbuting basis component and the basis component to which each feature contributes the most.
They are useful to annotate heatmaps and assess the ability of NMF methods to recover the true clusters.

As an example, one can conveniently visualize the target matrix as a heatmap, with or without the relevant sample and feature annotations, using simple calls to the \code{aheatmap} function:
<<heatmap_data>>=
par(mfrow=c(1,2))
aheatmap(X, annCol=covariates, annRow=X$fData)
aheatmap(X)
@

Then, we fit an NMF model using multiple runs, that will be used throughtout this vignette to illustrate the use of NMF heatmaps:

<<model, cache=TRUE>>=
res <- nmf(X, 3, nrun=10)
res
@

\nbnote{To keep the vignette simple, we always use the default NMF method
(i.e. \code{'brunet'}), but all steps could be performed using a different method,
or multiple methods in order to compare their perfromances.}

\section{Mixture Coefficient matrix: \texttt{coefmap}}

The coefficient matrix of the result can be plotted using the function
\code{coefmap}.
The default behaviour for multiple NMF runs is to add two annotation tracks that
show the clusters obtained by the best fit and the hierarchical clustering of
the consensus matrix\footnote{The hierarchical clustering is computed using the
consensus matrix itself as a similarity measure, and average linkage. See
\code{?consensushc}.}.
In the legend, these tracks are named \emph{basis} and \emph{consensus} respectively.
For single NMF run or NMF model objects, no consensus data are available, and
only the clusters from the fit are displayed.

<<coefmap_res, fig.keep='last'>>=
opar <- par(mfrow=c(1,2))
# coefmap from multiple run fit: includes a consensus track
coefmap(res)
# coefmap of a single run fit: no consensus track
coefmap(minfit(res))
par(opar)
@

\nbnote{Note how both heatmaps were drawn on the same plot, simply using the standard
call to \code{par(mfrow=c(1,2)}.
This is impossible to achieve with the R core function \code{heatmap}.
See \cref{sec:aheatmap} for more details about compatibility with base
and grid graphics.}

By default:
\begin{itemize}
\item the rows are not ordered;
\item the columns use the default ordering of \code{aheatmap}, but may easily be
ordered according to the clusters defined by the dominant basis component for
each column with \code{Colv="basis"}, or according to those implied by the
consensus matrix, i.e. as in \code{consensusmap}, with \code{Colv="consensus"};
\item each column is scaled to sum up to one;
\item the color palette used is \code{'YlOrRd'} from the
\citeCRANpkg{RColorBrewer}, with 50 breaks.
\end{itemize}

In term of arguments passed to the heatmap engine \code{aheatmap}, these default
settings translate as:

<<coefmap_default, eval=FALSE>>=
Rowv = NA
Colv = TRUE
scale = 'c1'
color = 'YlOrRd:50'
annCol = predict(object) + predict(object, 'consensus')
@

If the ordering does not come from a hierarchical clustering (e.g., if
\code{Colv='basis'}), then no dendrogram is displayed.
The default behaviour of \code{aheatmap} can be obtained by setting arguments
\code{Rowv=TRUE, Colv=TRUE, scale='none'}.

\medskip
The automatic annotation tracks can be hidden all together by setting argument
\code{tracks=NA}, displayed separately by passing only one of the given names
(e.g. \code{tracks=':basis'} or \code{tracks='basis:'} for the row or column respectively),
and their legend names may be changed by
specifying e.g. \code{tracks=c(Metagene=':basis', 'consensus')}.
Beside this, they are handled by the heatmap engine function \code{aheatmap}
and can be customised as any other annotation tracks -- that can be added via
the same argument \code{annCol} (see \cref{sec:aheatmap} or \code{?aheatmap} for
more details).

<<coefmap_custom, fig.keep='last', tidy=FALSE>>=
opar <- par(mfrow=c(1,2))
# removing all automatic annotation tracks
coefmap(res, tracks=NA)
# customized plot
coefmap(res, Colv = 'euclidean'
	, main = "Metagene contributions in each sample", labCol = NULL
	, annRow = list(Metagene=':basis'), annCol = list(':basis', Class=a, Index=c)
	, annColors = list(Metagene='Set2')
	, info = TRUE)
par(opar)
@

\nbnote{The feature that allows to display some information about the fit
at the bottom of the plot via argument \code{info=TRUE} is still experimental.
It is helpful mostly when developing algorithms or doing an analysis, but
would seldom be used in publications.}

\section{Basis matrix: \texttt{basismap}}

The basis matrix can be plotted using the function \code{basismap}.
The default behaviour is to add an annotation track that shows for each row
the dominant basis component.
That is, for each row, the index of the basis component with the highest
loading.

This track can be disabled by setting \code{tracks=NA}, and extra
row annotations can be added using the same argument \code{annRow}.

<<basismap_res, fig.keep='last'>>=
opar <- par(mfrow=c(1,2))
# default plot
basismap(res)
# customized plot: only use row special annotation track.
basismap(res, main="Metagenes", annRow=list(d, e), tracks=c(Metagene=':basis'))
par(opar)
@

By default:
\begin{itemize}
\item the columns are not ordered;
\item the rows are ordered by hierarchical clustering using default distance and
linkage methods (\code{'eculidean'} and \code{'complete'});
\item each row is scaled to sum up to one;
\item the color palette used is \code{'YlOrRd'} from the
\citeCRANpkg{RColorBrewer}, with 50 breaks.
\end{itemize}

In term of arguments passed to the heatmap engine \code{aheatmap}, these default
settings translate as:

<<basismap_default, eval=FALSE>>=
Colv = NA
scale = 'r1'
color = 'YlOrRd:50'
annRow = predict(object, 'features')
@

\section{Consensus matrix: \texttt{consensusmap}}

When doing clustering with NMF, a common way of assessing the stability of the
clusters obtained for a given rank is to consider the consensus matrix
computed over multiple independent NMF runs, which is the average of the connectivity
matrices of each separate run
\footnote{Hence, stability here means robustness with regards to the initial starting point,
and shall not be interpreted as in e.g. cross-validation/bootstrap analysis.
However, one can argue that having very consistent clusters across runs somehow supports
for a certain regularity or the presence of an underlying pattern in the data.}.
This procedure is usually repeated over a certain range of factorization ranks,
and the results are compared to identify which rank gives the best clusters,
possibly in the light of some extra knowledge one could have about the samples
(e.g. covariates).
The functions \code{nmf} and \code{consensusmap} make it easy to implement
this whole process.

\nbnote{The consensus plots can also be generated for fits obtained from single NMF runs,
in which case the consensus matrix simply reduces to a single connectivity matrix.
This is a binary matrix (i.e. entries are either 0 or 1), that will always produce
a bi-colour heatmap, and by default clear blocks for each cluster.}

\subsection{Single fit}
In section \cref{sec:data}, the NMF fit \code{res} was computed with argument
\code{nrun=10}, and therefore contains the best fit over 10 runs, as well as
the consensus matrix computed over all the runs
\footnote{If one were interested in keeping the fits from all the runs, the
function \code{nmf} should have been called with argument \code{.options='k'}.
See section \emph{Options} in \code{?nmf}.
The downstream hanlding of the result would remain identical.}.
This can be ploted using the function \code{consensusmap}, which allows for the
same kind of customization as the other NMF heatmap functions:

<<consensusmap_res, fig.keep='last'>>=
opar <- par(mfrow=c(1,2))
# default plot
consensusmap(res)
# customized plot
consensusmap(res, annCol=covariates, annColors=list(c='blue')
		, labCol='sample ', main='Cluster stability'
		, sub='Consensus matrix and all covariates')
par(opar)
@

By default:
\begin{itemize}
\item the rows and columns of the consensus heatmap are symmetrically
ordered by hierarchical clustering using the consensus matrix as a similarity
measure and average linkage, and the associated dendrogram is displayed;
\item the color palette used is the reverse of \code{'RdYlBu'} from the \citeCRANpkg{RColorBrewer}.
\end{itemize}

In term of arguments passed to the heatmap engine \code{aheatmap}, these default
settings translate as:

<<cmap_default, eval=FALSE>>=
distfun = function(x) as.dist(1-x) # x being the consensus matrix
hclustfun = 'average'
Rowv = TRUE
Colv = "Rowv"
color = '-RdYlBu'
@

\subsection{Single method over a range of ranks}

The function \code{nmf} accepts a range of value for the rank (argument \code{rank}),
making it fit NMF models for each value in the given range
\footnote{Before version 0.6, this feature was provided by the function \code{nmfEstimateRank}.
From version 0.6, the function \code{nmf} accepts ranges of ranks, and internally
calls the function \code{nmfEstimateRank} -- that remains exported and can still
be called directly.
See documentation \code{?nmfEstimateRank} for more details on the returned value.}:

<<estimate, cache=TRUE>>=
res2_7 <- nmf(X, 2:7, nrun=10, .options='v')
class(res2_7)
@

The result \code{res2\_7} is an S3 object of class \code{'NMF.rank'},
that contains -- amongst other data -- a list of the best fits obtained for each
value of the rank in range $\ldbrack 2, 7\rdbrack]$.
The method of \code{consensusmap} defined for class \code{'NMF.rank'},
which plots all the consensus matrices on the same plot:

<<consensusmap_estimate, fig.keep='last'>>=
consensusmap(res2_7)
@

\nbnote{ The main title of each consensus heatmap can be customized by passing
to argument \code{main} a character vector or a list whose elements specify each title.
All other arguments are used in each internal call to consensusmap, and will
therefore affect all the plots simultaneously.
The layout can be specified via argument \code{layout} as a numeric vector
giving the number of rows and columns in a \code{mfrow}-like way, or as a
matrix that will be passed to R core function \code{layout}.
See \code{?consensusmap} for more details and example code.
}

\subsection{Single rank over a range of methods}
If one is interested in comparing methods, for a given factorization rank, then
on can fit an NMF model for each method by providing the function \code{nmf} with
a \code{list} in argument \code{method}:

<<fit_methods, cache=TRUE>>=
res_methods <- nmf(X, 3, list('lee', 'brunet', 'nsNMF'), nrun=10)
class(res_methods)
@

The result \code{res\_methods} is an S4 object of class \code{NMFList}, which
is essentially a named list, that contains each fits and the CPU time required
by the whole computation.
As previously, the sequence of consensus matrices is plotted with \code{consensusmap}:

<<consensusmap_methods, fig.width=10, fig.height=7, fig.keep='last'>>=
consensusmap(res_methods)
@

\section{Generic heatmap engine: \texttt{aheatmap}}
\label{sec:aheatmap}

This section still needs to be written, but many examples of annotated heatmaps can be found in the demos \code{'aheatmap'} and \code{'heatmaps'}:
<<demo_hm, eval=FALSE>>=
demo('aheatmap')
# or
demo('heatmaps')
@

These demos and the plots they generate can also be browsed online at \url{http://nmf.r-forge.r-project.org/_DEMOS.html}.

\section{Session Info}
<<sessionInfo, echo=FALSE, results='asis'>>=
toLatex(sessionInfo())
@

\printbibliography[heading=bibintoc]


\end{document}