tutorial.tex - OpenGrok cross reference for /dports/biology/viennarna/ViennaRNA-2.4.18/RNA-Tutorial/tutorial.tex

% talk.tex
% > dvips -P pdf -ta4 talk -o talk.temp.ps
% > psnup -1 -m1cm -W128mm -H96mm -pa4 talk.temp.ps talk.handout.ps
% > psnup -2 -m1cm -b1cm -W128mm -H96mm -pa4 talk.temp.ps talk.handout.ps
%
% -*-latex-*-
\NeedsTeXFormat{LaTeX2e}
\documentclass[a4paper]{article}
%\usepackage{beamerarticle}
%\documentclass[compress,ignorenonframetext]{beamer}
\usepackage[english]{babel}
\usepackage{url}
\usepackage{color}
\usepackage[labelformat=empty]{caption}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{upquote}% necessary to get correct ` in verbatim enviroment and thereby a copy paste behaviour (http://tex.stackexchange.com/questions/63353/how-to-properly-display-backticks-in-verbatim-environment)
\usepackage{verbatim}
\usepackage{keystroke}
\usepackage{fancyvrb}
\usepackage{hyperref}
%% presentation mode
%% handout mode
%\mode<article>{
  \usepackage{graphics}
  \usepackage{pgf}
  \usepackage{xcolor}
  \renewcommand{\theenumi}{\alph{enumi}}
  \renewcommand{\labelenumi}{(\theenumi)}
  \renewcommand{\theenumii}{\Roman{enumii}}
  \renewcommand{\labelenumii}{\theenumii.}
  \newcommand{\frametitle}[1]{\subsubsection{#1}}
% }

%% \mode<presentation>{
%%   \beamertemplatenavigationsymbolsempty
%%   \useinnertheme{circles}
%%   \setbeamertemplate{frametitle}[default][center]
%%   \setbeamercovered{transparent}
%%   \setbeamertemplate{frametitle}[default][center]
%% }


% globals

\title{A short Tutorial on RNA Bioinformatics\newline
{\small The ViennaRNA Package and related Programs}}
\author{Ivo Hofacker, Ronny Lorenz, Dominik Steininger, and Sven Findei{\ss}}
\vspace{2cm}
%\institute[TBI]{Institute for Theoretical Chemistry\\ University Vienna}
%\titlegraphic{\centerline{\pgfuseimage{logo}}}
\date{\url{http://www.tbi.univie.ac.at/RNA/}\\[1ex]\today}

\newcommand{\TODO}[1]{{\textcolor{red}{* #1 *}}}


% pictures
\pgfdeclareimage[width=.5cm]{logo}{Figures/tbilogo}

% colors
\colorlet{darkgreen}{green!80!black}

% never indent paragraphs!
\setlength\parindent{0pt}

\DefineVerbatimEnvironment%
  {VerbatimExample}{Verbatim}
  {frame=single,label=Example,framerule=0.2mm,rulecolor=\color{blue}}

\DefineVerbatimEnvironment%
  {VerbatimTask}{Verbatim}
  {frame=single,label=Task,framerule=0.2mm,rulecolor=\color{red}}

%===
\begin{document}
%===

%===
%\frame{\titlepage}

\maketitle
\newpage

\tableofcontents
\newpage

%===
%%\begin{frame}
%%  \frametitle{Outline}
%%\tableofcontents
%%\setcounter{page}{1}
%%\end{frame}

%===
\section{RNA Web Services}
This tutorial aims to give a basic introduction to using the command line
programs in the ViennaRNA Package in a UNIX-like (LINUX) environment.
Of course, some of you may ask ``Why are there no friendly graphical user
interfaces?''. Well, there are some, especially in the form of web
services.

If a few simple structure predictions is all you want to do, there are
several useful sites for doing RNA structure analysis available
on the web. Indeed many of the tasks described below can be performed using
various web servers.
%\begin{frame}

%\frametitle{Useful Web Services}
\subsection{Useful Web Services}

\begin{itemize}
\item Michael Zuker's \texttt{mfold} server computes (sub)optimal structures
  and hybridization for DNA and RNA sequences with many options.\newline
  \href{http://mfold.rna.albany.edu/?q=mfold}{Mfold Website}
\item BiBiServ, several small services e.g.\ pseudo-knot prediction
  \texttt{pknotsRG}, bi-stable structures \texttt{paRNAss}, alignment
  \texttt{RNAforester}, visualization \texttt{RNAmovies}, suboptimal
  structures \texttt{RNAshapes}\newline
  \href{https://bibiserv.cebitec.uni-bielefeld.de/rna}{Bielefeld Bioinformatics Service}
\item The ViennaRNA Server offers web access to many tools of the ViennaRNA
Package, e.g. \texttt{RNAfold}, \texttt{RNAalifold}, \texttt{RNAinverse} and \texttt{RNAz}\newline
  \url{http://rna.tbi.univie.ac.at/}
\item several specialized servers such as \newline
\begin{itemize}
  \item \texttt{pfold} consensus structure prediction\newline
  \href{http://www.daimi.au.dk/~compbio/rnafold/}{ pfold RNA fold server}\newline
  \item \texttt{s-fold} stochastic suboptimals and siRNA design \newline
  \href{http://sfold.wadsworth.org/}{Sfold Webservices}\newline
  \item \texttt{StrAl} progressiv ncRNA alignment tool\newline
  \href{http://www.biophys.uni-duesseldorf.de/stral/}{StrAl Webservice}
\end{itemize}
\end{itemize}
%\end{frame}

Web servers are also a good starting point for novice users since they
provide a more intuitive interface. Moreover, the ViennaRNA Server will
return the equivalent command line invocation for each request, making the
transition from web services to locally installed software easier.

On the other hand, web servers are not ideal for analyzing many or very long
sequences and usually they offer only few often-used tasks.
Much the same is true for point-and-click graphical interfaces.
Command line tools, on the other hand, are ideally suited for automating
repetitive tasks. They can even be combined in pipes to process the
results of one program with another or they can be used in parallel, running
tens or hundreds of tasks simultaneously on a cluster of PCs.

You can try some of these web services in parallel to the exercises below.

\section{Get started}
\subsection{Typographical Conventions}
\begin{itemize}
  \item \texttt{Constant width font} is used for program names, variable
    names and other literal text like input and output in the terminal
    window.

  \item Lines starting with a \texttt{\$} within a literal text block
    are commands. You should type the text following the \texttt{\$} into
    your terminal window finishing by hitting the \Enter-key. (The
    \texttt{\$} signifies the command line prompt, which may look
    different on your system).

  \item All other lines within a literal text block are the output from
    the command you just typed.
\end{itemize}

\subsection{Data Files}
Data files containing the sequences used in the examples below are
shipped with this tutorial.

\subsection{Terminal, Command line and Editor}
%===
%\begin{frame}[fragile]
%\frametitle{Terminal and Command}

\begin{itemize}
  \item You can get a \textbf{terminal} by moving your
    mouse-pointer to an empty spot of your desktop, clicking the right
    mouse-button and choose ``Open Terminal'' from the pull-down menu.
  \item You can \textbf{run commands} in the terminal by typing them next
    to the command line prompt (usually something like \$) followed by
    hitting the \Enter-key.

\begin{VerbatimExample}
$ date
Tue Jul  7 14:30:25 CEST 2015
\end{VerbatimExample}
  \item To get more information about a command type \texttt{man} followed
    by the \textit{command-name} and hitting the \Enter-key.
Leave the man pages by pressing the \keystroke{q}-key.
\begin{VerbatimExample}
$ man date
\end{VerbatimExample}
  \item Redirect a command's input and output using the following special
    characters:\\
  `$|$' ties \textit{stdout} to \textit{stdin}\\
  `$<$' redirects \textit{stdout} to \textit{stdin}\\
  `$>$' redirects \textit{stdout} to a file\\
  Here, \textit{stdout} stands for \emph{standard output}, which you can
  normaly see in the terminal. \textit{stdin} is it's counterpart, the
  \emph{standard input}. The character `$|$' allows you to \emph{pipe}
  the standard output of one program directly as standard input into
  another program, hence, the programs are chained together.
\end{itemize}

  Below you'll find a list of some useful core commands available in all Linux
  terminal.

\begin{tabular}{ll}
  Command & Description\\\hline
  \texttt{pwd} & displays the path to the current working directory\\
  \texttt{cd} & changes the working directory (initially your ``HOME'')\\
  \texttt{ls} & lists files and directorys in the current (or a specified) directory\\
  \texttt{mkdir} & creates a directory\\
  \texttt{rm} & removes a file (add option \texttt{-r} for deleting a folder)\\
  \texttt{less} & shows file(s) one page at a time\\
  \texttt{echo} & prints string(s) to standard output\\
  \texttt{wc} & command prints the number of newlines, words and bytes in a specified file
\end{tabular}

\noindent
For more information regarding these commands prepend \texttt{--help} to the program call,
like this:
\begin{VerbatimExample}
$ rm --help
\end{VerbatimExample}

Try a few commands on your own, e.g.
\begin{VerbatimTask}
$ ls > file_list
$ less file_list
$ rm file_list
$ ls | less
\end{VerbatimTask}

Here the \textit{stdout} from the \texttt{ls} command was written to a file called \texttt{file\_list}.
The next command shows the content of \texttt{file\_list}. We quit \texttt{less} by pressing the
\keystroke{q}-key and removing the file. \texttt{ls $|$ less} pipes the output in the \texttt{less} program without writing it
to a file.

\noindent
Now we create our working directory including subfolders and our first sequence file using
the commands we just learned. Have in mind that you create a good structure so you can
find your data easily.

First find out in which directory your are in by typing
\begin{VerbatimTask}
$ pwd
\end{VerbatimTask}
It should look similar to
\begin{VerbatimExample}
$ /home/YOURUSER
\end{VerbatimExample}
To insure yourself, that you are in the correct directory type (\textasciitilde is the shortcut for the \texttt{home}-directory)
\begin{VerbatimExample}
$ cd ~
\end{VerbatimExample}
Now create a new folder in your home directory
\begin{VerbatimTask}
$ mkdir -p ~/Tutorial/Data
$ cd ~/Tutorial/Data
$ echo ATGAAGATGA > BAZ.seq
\end{VerbatimTask}
\noindent
Here we created two new folders in our \texttt{HOME}, \texttt{Tutorial} and a subfolder called \texttt{Data}, then
we jumped to the \texttt{Data}-folder and wrote a short DNA sequence to the \texttt{BAZ.seq} file.

\noindent
For further processing we need a RNA sequence instead of an DNA sequence, so we need to replace the T
by an U by executing following command using \texttt{sed} (the stream editor).
\begin{verbatim}
$ sed -i 's/T/U/g' BAZ.seq
\end{verbatim}
The program is called via \texttt{sed}, \texttt{-i} tells \texttt{sed} to replace the existing file (in this case \texttt{BAZ.seq}).
\texttt{s} stands for substitute T by U and \texttt{g} tells \texttt{sed} to replace all occuring T's in the file globaly).

When we look at our file using \texttt{less} we should see our new sequence ``AUGAAGAUGA''
\begin{verbatim}
$ less BAZ.seq
\end{verbatim}
%\end{frame}
%===

\subsection{Installing Software from Source}
\label{sect:install}

Many bioinformatics programs are available only as source code that has to
be compiled and installed. We'll demonstrate the standard way to install
programs from source using the \texttt{ViennaRNA Package}.

%===
%\begin{frame}[fragile]
 \frametitle{Get the \texttt{ViennaRNA Package}}
You can either get the required package, depending on which operating system you run
(precompiled package is availaible for distinct distributions like Fedora, Arch Linux, Debian, Ubuntu,
Windows) or you compile the source code yourself. Here we are compiling the programs ourself.
Have a look at the file \texttt{INSTALL} distributed with the \texttt{ViennaRNA Package}
for more detail or read the documentation on the url. \\

Subsequently the instructions for building the source code are:

  \begin{enumerate}
  \item Go to your \texttt{Tutorials} folder and create a directory
\begin{verbatim}
$ cd ..
$ mkdir downloads
$ cd downloads
\end{verbatim}
  \item Download the \texttt{ViennaRNA Package} from
    \url{http://www.tbi.univie.ac.at/RNA/index.html} and save it in to the newly created
    directory.
  \item Unpack the gzipped tar archive by running: (Replace 2.4.11 with the latest version number)
    \begin{verbatim}
       $ tar -zxf ViennaRNA-2.4.11.tar.gz
    \end{verbatim}
  \item list the content of the directory
    \begin{verbatim}
       $ ls -F
         ViennaRNA-2.4.11/  ViennaRNA-2.4.11.tar.gz
    \end{verbatim}
  \end{enumerate}
%\end{frame}

%===
\subsection{Build the \texttt{ViennaRNA Package}}
%===
%\begin{frame}[fragile]
\frametitle{Build the \texttt{ViennaRNA Package}}
The installation location can be controlled through options to the
\texttt{configure} script. E.g. to change the default installation
location to the directory \texttt{VRP} in your \texttt{\$HOME/Tutorial}
directory use the \texttt{--prefix} tag so the compiler knows that the
target directory is changed.

\begin{enumerate}
\item To configure and build the package just run the following commands.
\begin{verbatim}
$ cd ViennaRNA-2.4.11
$ mkdir -p ~/Tutorial/Progs/VRP
$ ./configure --prefix=$HOME/Tutorial/Progs/VRP
$ make
$ make install
\end{verbatim}
You already know the \texttt{cd} and the \texttt{mkdir} command, \texttt{./configure} checks
whether all dependencies are fulfilled and exits the script if some major requirements are missing.
If all is ok it creates the \texttt{Makefile} which then is used to start the buildingprocess via \texttt{make install}.
\item To install the \texttt{ViennaRNA package} system wide (only
for people with superuser privileges, which we are NOT!) run
\begin{verbatim}
$ ./configure
$ make
$ make install
\end{verbatim}
\end{enumerate}
%\end{frame}
\noindent
You find the installed files in
\begin{enumerate}
\item \texttt{\$HOME/Tutorial/Progs/VRP/bin} (programs)
\item \texttt{\$HOME/Tutorial/Progs/VRP/share/ViennaRNA/bin} (perl scripts)
\end{enumerate}
Wherever you installed the main programs of the \texttt{ViennaRNA Package},
make sure the path to the executables shows up in your \texttt{PATH}
environment variable. To check the contents of the \texttt{PATH} environment
variable simply run
\begin{verbatim}
$ echo $PATH
\end{verbatim}
For easier handling we now create a folder containing all our binaries as well as perl scripts and
copy them into a common folder.
\begin{verbatim}
$ cd ~/Tutorial/Progs/
$ cp VRP/share/ViennaRNA/bin/* .
\end{verbatim}
Now you can show the contents of the folder using the command \texttt{ls}.


Also copy the binaries from the \texttt{VRP/bin} folder. In the next step we add the path
of the directory to the PATH environment variable (e.g. use \texttt{pwd}) so we don't need to
write the hole path every time we call it.
\begin{verbatim}
$ export PATH=${HOME}/Tutorial/Progs:${PATH}
\end{verbatim}
%
Note that this is only a temporary solution. If you want the path to be
permanently added you need to add the line above to the config file of your
shell environment. Typically \texttt{bash} is the standard. You need to add the export line
above to the \texttt{.bashrc} in your homedirectory. To reload the contents of \texttt{.bashrc} type
\begin{verbatim}
$ source ~/.bashrc
\end{verbatim}
or close the current terminal and open it again. (Remember, this works only for the \texttt{bash} shell.)
%
To check if everything worked out find which source you use.
%
\begin{verbatim}
  $ which RNAfold
\end{verbatim}
The shown path should point to \texttt{\$HOME/Tutorial/Progs/}. Finally try to
get a brief description of a program e.g.
%
\begin{verbatim}
$ RNAfold --help
\end{verbatim}
%
If this doesn't work re-read the steps described above more carefully.
%===

\subsection{What's in the \texttt{ViennaRNA Package}}
The core of the \texttt{ViennaRNA Package} is formed by a collection
of routines for the prediction and comparison of RNA secondary
structures. These routines can be accessed through stand-alone
programs, such as \texttt{RNAfold}, \texttt{RNAdistance}, etc., which
should be sufficient for most users. For those who wish to develop
their own programs a library which can be linked to your own code is
provided.

%===
%\begin{frame}[fragile]
  \frametitle{The base directory}

  \begin{itemize}
  \item make a directory listing of \texttt{downloads/ViennaRNA-2.4.11/}
\begin{verbatim}
  $ ls -F ~/Tutorial/downloads/ViennaRNA-2.4.11/
\end{verbatim}
  \end{itemize}
  \footnotesize
  \begin{center}
    \begin{verbatim}
        aclocal.m4     config.sub*   INSTALL      man/           RNA-Tutorial/
        AUTHORS        configure*    install-sh*  misc/          src/
        CHANGELOG.md   configure.ac  interfaces/  missing*       tests/
        compile*       COPYING       license.txt  NEWS           THANKS
        config/        depcomp*      m4/          packaging/     ylwrap*
        config.guess*  doc/          Makefile.am  README.md
        config.h.in    examples/     Makefile.in  RNAlib2.pc.in
    \end{verbatim}%$
  \end{center}
You now see the contents of the \texttt{ViennaRNA-2.4.11} folder. Directorys are marked
by a "/" and the "*" indicates executable files. The \texttt{Makefile} contains the rules to
compile the code, source code is located within the \texttt{src/} directory. \texttt{configure}
handles distinct options for installation and creation of the \texttt{Makefile}. \texttt{INSTALL}
covers installation instructions and the \texttt{README} file contains information about the
\texttt{ViennaRNA Package}.
%\end{frame}


%===
%\begin{frame}
  \frametitle{Which programs are available?}

  {\small
  \begin{tabular}{ll}
	RNA2Dfold & Compute coarse grained energy landscape of representative sample structures\\
	RNAaliduplex & Predict conserved RNA-RNA interactions between two alignments\\
	RNAalifold & Calculate secondary structures for a set of aligned RNA sequences\\
	RNAcofold & Calculate secondary structures of two RNAs with dimerization\\
	RNAdistance & Calculate distances between RNA secondary structures\\
	RNAduplex & Compute the structure upon hybridization of two RNA strands\\
	RNAeval & Evaluate free energy of RNA sequences with given secondary structure\\
	RNAfold & Calculate minimum free energy secondary structures and partition function of RNAs\\
	RNAheat & Calculate the specific heat (melting curve) of an RNA sequence\\
	RNAinverse & Find RNA sequences with given secondary structure (sequence design)\\
	RNALalifold & Calculate locally stable secondary structures for a set of aligned RNAs\\
	RNALfold & Calculate locally stable secondary structures of long RNAs\\
	RNApaln & RNA alignment based on sequence base pairing propensities\\
	RNApdist & Calculate distances between thermodynamic RNA secondary structures ensembles\\
	RNAparconv & Convert energy parameter files from ViennaRNA 1.8 to 2 format\\
	RNAPKplex & Predict RNA secondary structures including pseudoknots\\
	RNAplex & Find targets of a query RNA\\
	RNAplfold & Calculate average pair probabilities for locally stable secondary structures\\
	RNAplot & Draw and markup RNA secondary structures in PostScript, SVG, or GML\\
	RNApvmin & Find a vector of perturbation energies which may further be used to constrain folding\\
	RNAsnoop & Find targets of a query H/ACA snoRNA\\
	RNAsubopt & Calculate suboptimal secondary structures of RNAs\\
	RNAup & Calculate the thermodynamics of RNA-RNA interactions\\
	Kinfold & simulates  the stochastic folding kinetics of RNA sequences into secondary structures\\
	RNAforester \footnotemark & compare RNA secondary structures via forest alignment\\
  \end{tabular}}
	\footnotetext{RNAforester is not developed by the TBI Vienna.}

%\end{frame}

%===
%\begin{frame}[fragile]
  \frametitle{Which Utilities are available?}

  {\small
  \begin{tabular}{ll}
	b2ct	& converts dot-bracket notation to Zukers mfold '.ct' file format \\
	b2mt.pl	& converts dot-bracket notation to x y values \\
	cmount.pl	& generates colored mountain plot\\
	coloraln.pl	& colorize an alirna.ps file\\
	colorrna.pl	& colorize a secondary structure with reliability annotation\\
	ct2db	& converts Zukers mfold '.ct' file format to dot-bracket notation\\
	dpzoom.pl	& extract a portion of a dot plot \\
	mountain.pl	& generates mountain plot \\
	popt	& extract Zuker's p-optimal folds from subopt output\\
	refold.pl	& refold using consensus structure as constraint\\
	relplot.pl	& add reliability information to a RNA secondary structure plot\\
	rotate\_ss.pl	& rotate the coordinates of an RNA secondary structure plot\\
	switch.pl	& describes RNA sequences that exhibit two almost equally stable structures\\
  \end{tabular}}
\noindent
%\end{frame}

\vspace*{2ex}\noindent
All programs that are shipped with the \texttt{ViennaRNA Package} provide
some documentation in the form of ``man pages''. In UNIX like environments, these
manual pages can be viewed using the \texttt{man} command after successfully
installing the \texttt{ViennaRNA Package}:
\begin{verbatim}
$ man RNAalifold
\end{verbatim}%$
Alternatively, an online version of the manual pages is available at\\
\url{https://www.tbi.univie.ac.at/RNA/documentation.html#programs}.
Note, that the \texttt{MANPATH} environment variable requires to be updated
if the \texttt{ViennaRNA Package} has been installed in a non-standard path.

There also is a helpful documentation in the folder of the \texttt{ViennaRNA Package}:\\
\texttt{~/Tutorial/downloads/ViennaRNA-2.4.11/doc/RNAlib-2.4.11.pdf}\\

Most Perl scripts carry embedded documentation that is displayed by typing
\begin{verbatim}
$ perldoc coloraln.pl
\end{verbatim}%$
in the folder where the script is located.
All scripts and programs give short usage instructions when called with
the \texttt{-h} command line option (e.g. \texttt{RNAalifold -h}).
%===

\subsection{The Input File Format}
RNA sequences come in a variety of formats. The sequence format used
throughout the ViennaRNA Package is very simple. A sequence
file contains one or more sequences. Each sequence either spans a single
line without any additional whitespaces, or the file is \texttt{FASTA}
formatted. In the latter case, the sequence is preceded by a special header
line that starts with the `\texttt{>}' character followed by a sequence identifier.
This identifier, usually a unique name assigned to the sequence, will then be
used by the programs in the \texttt{ViennaRNA Package} as basename for any output
files. Please note some of the programs do not support the \texttt{FASTA} format
yet. Furthermore, programs that require multiple input sequences, e.g. for
interaction prediction, may require them as separate lines, or in a concatenated
form on a single line with the delimiting character \texttt{\&}. Please read
the corresponding manpages and \texttt{--help} output to find out the actual
input format requirements.

%===
%\input{single_seq.tex}
\section{Structure Prediction on single Sequences}
\subsection{The Program \texttt{RNAfold}}
%===
Our first task will be to do a structure prediction using
\texttt{RNAfold}. This should get you familiar with the input and output
format as well as the graphical output produced.

\texttt{RNAfold} reads single RNA sequences, computes their minimum free energy
(\texttt{MFE}) structures, and prints the result together with the corresponding
\texttt{MFE} structure in dot-bracket notation. This is the default mode if no
further command line parameters are provided. Please note, that the \texttt{RNAfold}
program can either be used in \textit{interactive mode}, where the program expects
the input from \textit{stdin}, or in \textit{batch processing mode} where
you provide the input sequences as text files.

To activate computation of the partition function for each sequence, the
\texttt{-p} option must be set. From the partition function
$$Q = \sum_{s \in \Omega} exp(-E(s) / RT)$$

over the ensemble of all possible structures $\Omega$, with temperature $T$ and gas
constant $R$, \texttt{RNAfold} then computes the ensemble free energy $G = -RT \cdot ln(Q)$,
and frequency of the \texttt{MFE} structure $s_{mfe}$ within the ensemble
$$p = exp(-E(s_{mfe}) / RT) / Q$$

Furthermore, by default, the \texttt{-p} option also activates the computation
of base pairing probabilities $p_{ij}$. From this data, \texttt{RNAfold} then
determines the ensemble diversity
$$\langle d \rangle = \sum_{ij} p_{ij} \cdot (1 - p_{ij}),$$
i.e. the expected distance between any two secondary structure, as well as the
\texttt{centroid} structure, i.e. the structure $s_c$ with the least Boltzmann weighted
distance
$$d_\Omega(s_c) = \sum_{\substack{s \in \Omega}} p(s) d(s_c, s).$$ to all other
structures $s \in \Omega$.

Another useful structure representative one can determine from base pairing probabilities
$p_{ij}$ is the structure that exhibits the \textit{maximum expected accuracy (MEA)}. By
assuming the base pair probability is a good measure of correctnes of a pair $(i,j)$, the
expected accuracy of a structure $s$ is
$$\text{EA}(s) = \sum_{\substack{(i,j) \in s}} 2\gamma p_{ij} + \sum_{\substack{i \\ \nexists (i,j) \in s}} q_i$$
with $q_i = 1 - \sum_j p_{ij}$ and weighting factor $\gamma$ that allows us to
weight paired against unpaired positions. \texttt{RNAfold} uses a dynamic programming
scheme similar to the \textit{Maximum Matching algorithm} of Ruth Nussinov to find the
structure $s$ that minimizes the above equation.

The \texttt{RNAfold} program provides a large amount of additional
computation modes that will be partly covered below. To get a full list of all
computation modes available, please consult the \texttt{RNAfold} man page or
the outputs of \texttt{RNAfold -h} and \texttt{RNAfold --detailed-help}.

\subsubsection{MFE structure of a single sequence}

\begin{enumerate}
\item Use a text editor (emacs, vi, nano, gedit) to prepare an input file by pasting the text
below and save it under the name \texttt{test.seq} in your \texttt{Data} folder.
\begin{verbatim}
> test
CUACGGCGCGGCGCCCUUGGCGA
\end{verbatim}
\item Compute the best (MFE) structure for this sequence using \textit{batch processing mode}
\begin{verbatim}
  $ RNAfold test.seq
  CUACGGCGCGGCGCCCUUGGCGA
  ...........((((...)))). ( -5.00)
 \end{verbatim}%$
\item or use the \textit{interactive mode} and redirect the content of \texttt{test.seq}
to \textit{stdin}
\begin{verbatim}
  $ RNAfold < text.seq
  CUACGGCGCGGCGCCCUUGGCGA
  ...........((((...)))). ( -5.00)
\end{verbatim}
\item alternatively, you could use the \textit{interactive mode} and manually enter the sequence
  as soon as \texttt{RNAfold} prompts for input
\begin{verbatim}
  $ RNAfold
  Input string (upper or lower case); @ to quit
  ....,....1....,....2....,....3....,....4....,....5....,....6....,....7....,....8
  CUACGGCGCGGCGCCCUUGGCGA
  length = 23

  CUACGGCGCGGCGCCCUUGGCGA
  ...........((((...)))).
   minimum free energy =  -5.00 kcal/mol
 \end{verbatim}%$
%\end{frame}
\end{enumerate}

All the above variants to compute the MFE and the corresponding structure result in identical
output, except for slight variations in the formatting when true \textit{interactive mode} is used.
The last line(s) of the text output contains the predicted MFE structure in \textit{dot-bracket notation}
and its free energy in \texttt{kcal/mol}. A dot in the dot-bracket notation represents an unpaired
position, while a base pair (i, j) is represented by a pair of matching parentheses at position
i and j.

If the input was \texttt{FASTA} formatted, i.e. the sequence was preceded by a header line
with sequence identifier, \texttt{RNAfold} creates a structure layout file named \texttt{test\_ss.ps},
where \texttt{test} is the sequence identifier as provided through the \texttt{FASTA} header.
In case the header was omitted the output file name simply is \texttt{rna.ps}.\\
Let's take a look at the output file with your favorite \texttt{PostScript} viewer, e.g. \texttt{gv}.\footnote{In contrast to
  bitmap based image files (such as GIF or JPEG) PostScript files contain resolution
  independent vector graphics, suitable for publication. They can be
  viewed on-screen using a postscript viewer such as \texttt{gv} or
  \texttt{evince}}. Note the \& at the end of the following command line that simply detaches
the program call and immediately starts the program in the background.
\begin{verbatim}
$ gv test_ss.ps &
\end{verbatim}
\noindent
Compare the dot-bracket notation to the PostScript
drawing shown in the file \texttt{test\_ss.eps}.

You can use the \texttt{-t} option to change the layout algorithm \texttt{RNAfold} uses
to produce the plot. The most simply layout is the \textit{radial} layout that can be chosen
with \texttt{-t 0}. Here, each nucleotide in a loop is equally spaced on its enclosing circle.
The more sophisticated \texttt{Naview} layout algorithm is used by default but may be explicitly
chosen through \texttt{-t 1}. A hidden feature can be found with \texttt{-t 2}, where \texttt{RNAfold}
creates a most simple circular plot.

The calculation above does not tell us whether we can actually trust the predicted structure.
In fact, there may be many more possible structures that might be equally probable. To find
out about that, let's have a look at the equilibrium ensemble instead.

\subsubsection{Predicting equilibrium properties of the structure ensemble}

\begin{enumerate}
\item Run \texttt{RNAfold -p --MEA} to compute the partition function,
pair probabilities, centroid structure, and the maximum expected accuracy
(MEA) structure.
\item Have a look at the generated PostScript files \texttt{test\_ss.ps} and
\texttt{test\_dp.ps}
\begin{verbatim}
  $ RNAfold -p --MEA test.seq
  CUACGGCGCGGCGCCCUUGGCGA
  ...........((((...)))). ( -5.00)
  ....{,{{...||||...)}}}. [ -5.72]
  ....................... {  0.00 d=4.66}
  ......((...))((...))... {  2.90 MEA=14.79}
   frequency of mfe structure in ensemble 0.311796; ensemble diversity 6.36

\end{verbatim}
%\end{frame}
  \end{enumerate}
\noindent
Here the last four lines are new compared to the text output without the \texttt{-p --MEA}
options. The partition function is already a rough measure for the well-definedness of the \texttt{MFE}
structure. The third line shows a condensed representation of the pair probabilities of each
nucleotide, similar to the dot-bracket notation, followed by the ensemble free energy
($G = -kT \cdot ln(Z)$) in \texttt{kcal/mol}. Here, the dot-bracket like notation consists
of additional characters that denote the pairing propensity for each nucleotide.
"." denotes bases that are essentially unpaired, "," weakly paired,
"$|$"strongly paired without preference, "\{\},()" weakly ($>$33\%) upstream (downstream)
paired or strongly ($>$66\%) up-/downstream paired bases, respectively.\\

The next two lines represent (i) the centroid structure
with its free energy and distance to the ensemble, and (ii) the MEA structure, it's free
energy and the actual accuracy. The very last line shows the frequency of the MFE structure in
the ensemble of secondary structures and the diversity of the ensemble as discussed above.

Note that the MFE structure is adopted only with 31\% probability, also the
diversity is very high for such a short sequence.\\

\subsubsection{Rotate the structure plot}

\includegraphics[width=.50\textwidth]{Figures/test_ss.eps}\\

To rotate the secondary structure plot that is generated by \texttt{RNAfold}
the \texttt{ViennaRNA Package} provides the perl script utility \texttt{rotate\_ss.pl}.
Just read the \texttt{perldoc} for this tool to know how to handle the rotation and use
the information to get your secondary structure in a vertical position.
\begin{verbatim}
$ perldoc rotate_ss.pl
\end{verbatim}%$


\subsubsection{The base pair probability dot plot}

\includegraphics[width=.50\textwidth]{Figures/test_dp.eps}\\
%\end{frame}

The ``dot plot'' (\texttt{test\_dp.ps}) shows the pair probabilities within
the equilibrium ensemble as $n\times n$ matrix, and is an excellent way to
visualize structural alternatives. A square at row $i$ and column $j$
indicates a base pair. The area of a square in the upper right half of the
matrix is proportional to the probability of the base pair $(i,j)$ within the
equilibrium ensemble. The lower left half shows all pairs belonging to
the \texttt{MFE} structure. While the MFE consists of a single helix, several
different helices are visualized in the pair probabilities.

While a base pair probability dot-plot is quite handy to interpret for short
sequences, it quickly becomes confusing the longer the RNA sequence is. Still,
this is (currently) the only output of base pair probabilities for the \texttt{RNAfold}
program. Nevertheless, since the dot plot is a true \texttt{PostScript} file,
one can retrieve the individual base pair probabilities by parsing its textual
content.

\begin{enumerate}
\item Open the dot plot with your favorite text editor
\item Locate the lines that that follow the scheme
\begin{verbatim}
  i j v ubox
\end{verbatim}
where $i$ and $j$ are integer values and $v$ is a floating point decimal
with values between $0$ and $1$. These are the data for the boxes drawn in
the upper triangle. The integer values $i$ and $j$ denote the nucleotide positions
while the value $v$ is the square-root of the probability of base pair $(i,j)$.
Thus, the actual base pair probability $p(i,j) = v * v$.

\end{enumerate}

\subsubsection{Mountain and Reliability plot}
Next, let's use the \texttt{relplot.pl} utility to annotate which parts of a
predicted MFE structure are well-defined and thus more reliable. Also let's use a real
example for a change and produce yet another representation of the predicted
structure, the \emph{mountain plot}.

\noindent
Fold the 5S rRNA sequence and visualize the structure. (The \texttt{5S.seq} is shipped with the tutorial)
\begin{verbatim}
  $ RNAfold -p 5S.seq
  $ mountain.pl 5S_dp.ps | xmgrace -pipe
  $ relplot.pl 5S_ss.ps 5S_dp.ps > 5S_rss.ps
\end{verbatim}

  \includegraphics[width=.45\textwidth]{Figures/5S_mt.eps}\hfill
  \includegraphics[trim=0cm 1.5cm 0cm 0cm, width=.50\textwidth]{Figures/5S_rot.eps}

A mountain plot is especially useful for long sequences where conventional
structure drawings become terribly cluttered.  It is a xy-diagram plotting
the number of base pairs enclosing a sequence position \textit{versus} the
position. The  \texttt{Perl} script \texttt{mountain.pl} transforms a dot
plot into the mountain plot coordinates which can be visualized with any
xy-plotting program, e.g. \texttt{xmgrace}.

The resulting plot shows three curves, two mountain plots derived from
the \texttt{MFE} structure (red) and the pairing probabilities (black) and
a positional entropy curve (green). Well-defined regions are identified by low
entropy. By superimposing several mountain plots structures can easily
be compared.

The perl script \texttt{relplot.pl} adds reliability
information to a RNA secondary structure plot in the form of color
annotation. The script computes a well-definedness measure we call
``positional entropy''
$$S(i) = -\sum p_{ij}\log(p_{ij})$$
and encodes it as color hue, ranging from red
(low entropy, well-defined) via green to blue and violet (high
entropy, ill-defined). In the example above two helices of the 5S RNA are
well-defined (red) and indeed predicted correctly, the left arm is not quite
correct and disordered.

For the figure above we had to rotate and mirror the structure plot, e.g.
\begin{verbatim}
$ rotate_ss.pl -a 180 -m 5S_rss.ps > 5S_rot.ps
\end{verbatim}%$

\subsubsection{Batch job processing}
In most cases, one doesn't only want to predict the structure and equilibrium
probabilities for a single RNA sequence but a set of sequences. \texttt{RNAfold}
is perfectly suited for this task since it provides several different mechanisms
to support batch job processing. First, in \textit{interactive} mode, it only
stops processing input from \textit{stdin} if it is requested to do so. This means
that after processing one sequence, it will prompt for the input of the next
sequence. Entering the \texttt{@} character will forcefully abort processing.
In situations where the input is provided through input stream redirection,
it will end processing as soon stream is closed.

In constrat to that, the \textit{batch processing mode} where one simply specifies
input files as so-called unnamed command line parameters, the number of input
sequences is more or less unlimited. You can specify as many input files as
your terminal emulator allows, and each input file may consist of arbitrarily
many sequences. However, please note that mixing \texttt{FASTA} and non-fasta
input is not allowed and will most likely produce bogus output.

Assume you have four input files \texttt{file\_0.fa}, \texttt{file\_1.fa},
\texttt{file\_2.fa}, and \texttt{file\_3.fa}. Each file contains a set of RNA
sequences in \texttt{FASTA} format. Predicting secondary structures for all
sequences in all files with a single call to \texttt{RNAfold} and redirecting
the output to a file \texttt{all\_sequences\_output.fold} can be achieved
like this:
\begin{verbatim}
  $ RNAfold file_0.fa file_1.fa file_2.fa file_3.fa > all_sequences_output.fold
\end{verbatim}

The above call to \texttt{RNAfold} will open each of the files and process the
sequences sequentially. This, however, might take a long time and the sequential
processing will most likely bore out your multi-core workstation or laptop computer,
since only a single core is used for the computations while the others are idle.
If you happen to have more than a single CPU core and want to take advantage of
the available parallel processing power, you can use the \texttt{-j} option of\
\texttt{RNAfold} to split the input into concurrent jobs.
\begin{verbatim}
  $ RNAfold -j file_*.fa > all_sequences_output.fold
\end{verbatim}
This command will uses as many CPU cores as available and, therefore, process
you input much faster. If you want to limit the number of concurrent jobs to
a particular number, say $2$, to leave the remaining cores available for other
tasks, you can append the number of jobs directly to the \texttt{-j} option:
\begin{verbatim}
  $ RNAfold -j2 file_*.fa > all_sequences_output.fold
\end{verbatim}
Note here, that there must not be any space between the \texttt{j} and the number
of jobs.

Now imagine what happens if you have a larger set of
sequences that are not stored in \texttt{FASTA} format. If you would serve such
an input to \texttt{RNAfold}, it would happily process each of the sequences
but always over-write the structure layout and dot-plot files, since the default
names for these files are \texttt{rna.ps} and \texttt{dot.ps} for any sequence.
This is usually an undesired behavior, where \texttt{RNAfold} and the \texttt{--auto-id}
option becomes handy. This option flag forces \texttt{RNAfold} to automatically
create a sequence identifier for each input, thus using different file names for
each single output. The identifier that is created follows the form
\begin{verbatim}
  sequence_XXXX
\end{verbatim}
where \texttt{sequence} is a prefix, followed by the delimiting character \texttt{\_},
and an increasing 4-digit number \texttt{XXXX} starting at 0000. This feature is
even useful if the input is in \texttt{FASTA} format, but one wants to enforce
a novel naming scheme for the sequences. As soon as the \texttt{--auto-id} option
is set, \texttt{RNAfold} will ignore any id taken from existing \texttt{FASTA}
headers in the input files.

See also the man page of \texttt{RNAfold} to find out how to modify the prefix,
delimiting character, start number and number of digits.

\begin{enumerate}
\item Create an input file with many RNA sequences, each on a separate line, e.g.
\begin{verbatim}
  $ randseq -n 127 > many_files.seq
\end{verbatim}
\item Compute the MFE structure for each of the sequences and generate output
ids with numbers between $100$ and $226$ and prefix \texttt{test\_seq}
\begin{verbatim}
  $ RNAfold --auto-id --id-start=100 --id-prefix="test_seq" many_files.seq
\end{verbatim}
\end{enumerate}

\subsubsection{Add constraints to the structure prediction}
For some scientific questions one requires additional constraints that must be
enforced when predicting secondary structures. For instance, one might have resolved
parts of the structure already and is simply interested in the optimal conformation
of the remaining part of the molecule. Another example would be that one already
knows that particular nucleotides can not participate in any base pair, since they
are physically hindered to do so. These types of constraints are termed \textit{hard}
constraints and they can enforce or prohibit particular conformations, thus include
or omit structures with these feature from the set candidate ensemble.

Another type of constraints are so-called \textit{soft} constraints, that enable one
to adjust the free energy contributions of particular conformations. For instance,
one could add a bonus energy if a particular (stretch of) nucleotides is left unpaired
to emulate the binding free energy of a single strand binding protein. The same can
be applied to base pairs, for instance one could add a penalizing energy term if a
particular base pair is formed to make it less likely.

The \texttt{RNAfold} programs comes with a comprehensive hard and soft constraints
support and provides several convenience command line parameters to ease constraint
application.

The most simple hard constraint that can be applied is the maximum base pair span, i.e.
the maximum number of nucleotides a particular base pair may span. This constraint can
be applied with the \texttt{--maxBPspan} option followed by an integer number.
\begin{enumerate}
  \item Compute the secondary structure for the \texttt{5S.seq} input file
  \item Now limit the maximum base pair span to $50$ and compare both results
\begin{verbatim}
  $ RNAfold --maxBPspan 50 5S.seq
\end{verbatim}
\end{enumerate}

Now assume you already know parts of the structure and want to \textit{fill-in}
an optimal remaining part. You can do that by using the \texttt{-C} option
and adding an additional line in dot-bracket notation to the input (after the sequence)
that corresponds to the known structure:
\begin{enumerate}
\item Prepare the input file \texttt{hard\_const\_example.fa}
\begin{verbatim}
  >my_constrained_sequence
  GCCCUUGUCGAGAGGAACUCGAGACACCCACUACCCACUGAGGACUUUCG
  ..((((.....))))
\end{verbatim}
  Note here, that we left out the remainder of the input structure constraint that will
  eventually be used to enforce a helix of 4 base pairs at the beginning of the sequence.
  You may also fill the remainder of the constraint with dots to silence any warnings issued
  by \texttt{RNAfold}.
\item Compute the MFE structure for the input
\begin{verbatim}
  $ RNAfold hard_const_example.fa
  >my_constrained_sequence
  GCCCUUGUCGAGAGGAACUCGAGACACCCACUACCCACUGAGGACUUUCG
  ........((((((...((((.................))))..)))))) ( -8.00)
\end{verbatim}
\item Now compute the MFE structure under the provided constraint
\begin{verbatim}
  $ RNAfold -C hard_const_example.fa
  >my_constrained_sequence
  GCCCUUGUCGAGAGGAACUCGAGACACCCACUACCCACUGAGGACUUUCG
  ..((((.....))))....(((((..((((........)).))..))))) ( -7.90)
\end{verbatim}
\item Due to historic reasons, the \texttt{-C} option alone only forbids any base pairs
that are incompatible with the constraint, rather than enforcing the constraint. Thus,
if you compute equilibrium probabilities, structures that are missing the small helix in
the beginning are still part of the ensemble. If you want to compute the pairing probabilities
upon forcing the small helix at the beginning, you can add the \texttt{--enforceConstraint} option:
\begin{verbatim}
  $ RNAfold -p -C --enforceConstraint hard_const_example.fa
  >my_constrained_sequence
  GCCCUUGUCGAGAGGAACUCGAGACACCCACUACCCACUGAGGACUUUCG
  ..((((.....))))....(((((..((((........)).))..))))) ( -7.90)
\end{verbatim}
  Have a look at the differences in ensemble free energy and base pair probabilities between
  the results obtained with and without the \texttt{--enforceConstraint} option.
\end{enumerate}

A more thorough alternative to provide constraints is to use the \texttt{--commands} option
and a corresponding \textit{commands file}. This allows one to specify constraints on nucleotide
or base pair level and even to restrict a constraint to particular loop types. A commands file
is a simple multi column text file with one constraint on each line. A line starts with a one- or
two-letter command, followed by multiple values that specify the addressed nucleotides, the loop
context restriction, and, for soft constraints, the strength of the constraint in $kcal/mol$.
The syntax is as follows:

{\footnotesize
\begin{verbatim}
F i 0 k   [TYPE] [ORIENTATION] # Force nucleotides i...i+k-1 to be paired
F i j k   [TYPE] # Force helix of size k starting with (i,j) to be formed
P i 0 k   [TYPE] # Prohibit nucleotides i...i+k-1 to be paired
P i j k   [TYPE] # Prohibit pairs (i,j),...,(i+k-1,j-k+1)
P i-j k-l [TYPE] # Prohibit pairing between two ranges
C i 0 k   [TYPE] # Nucleotides i,...,i+k-1 must appear in context TYPE
C i j k          # Remove pairs conflicting with (i,j),...,(i+k-1,j-k+1)
E i 0 k e        # Add pseudo-energy e to nucleotides i...i+k-1
E i j k e        # Add pseudo-energy e to pairs (i,j),...,(i+k-1,j-k+1)
\end{verbatim}
}
with
{\footnotesize
\begin{verbatim}
[TYPE]        = { E, H, I, i, M, m, A }
[ORIENTATION] = { U, D }
\end{verbatim}
}

\begin{enumerate}
\item Prepare a commands file \texttt{test.constraints} that forces the first 5 nucleotides to pair and the
following 3 nucleotides to stay unpaired as part of a multi-branch loop:
\begin{verbatim}
F 1 0 5
C 6 0 3 M
\end{verbatim}
\item Use the \texttt{randseq} program to generate multiple sequences and compute the MFE structure
for each under the constraints prepared earlier.
\begin{verbatim}
  $ randseq -n 20 | RNAfold --commands test.constraints
\end{verbatim}
Inspect the output to assure yourself that hte commands have been applied
\end{enumerate}

A couple of much more sophisticated constraints will be discussed below.

\subsubsection{SHAPE directed RNA folding}

In order to further improve the quality of secondary structure predictions, mapping experiments like
SHAPE (selective 2'-hydroxyl acylation analyzed by primer extension) can be used to exerimentally determine
the pairing status for each nucleotide.
In addition to thermodynamic based secondary structure predictions, RNAfold supports the incorporation of this additional
experimental data as soft constraints.

If you want to use SHAPE data to guide the folding process, please make sure that your experimental data is present in a text file,
where each line stores three white space separated columns containing the position, the abbreviation and the normalized SHAPE reactivity for
a certain nucleotide.

\begin{verbatim}
     1 G 0.134
     2 C 0.044
     3 C 0.057
     4 G 0.114
     5 U 0.094
        ...
        ...
        ...
     71 C 0.035
     72 G 0.909
     73 C 0.224
     74 C 0.529
     75 A 1.475
\end{verbatim}%$

The second column, which holds the nucleotide abbreviation, is optional.
If it is present, the data will be used to perform a cross check against the provided input sequence.
Missing SHAPE reactivities for certain positions can be indicated by omitting the reactivity column or the whole line.
Negative reactivities will be treated as missing.
Once the SHAPE file is ready, it can be used to constrain folding:

\begin{verbatim}
$ RNAfold --shape=rna.shape --shapeMethod=D < rna.seq
\end{verbatim}%$

A small compilation of reference data taken from Hajdin et al. 2013 is available online
\url{https://weeks.chem.unc.edu/data-files/ShapeKnots_DATA.zip}. However, the included
reference structures are only available in connect (.ct) format and require conversion into
dot-bracket notation to compare them against predicted structures with \texttt{RNAfold}.
Furthermore, the normalized \texttt{SHAPE} data is available as Excel spreadsheet and
also requires some pre-processing to make it available for \texttt{RNAfold}.


\subsubsection{Adding ligand interactions}
RNA molecules are known to interact with other molecules, such as additional RNAs, proteins,
or other small ligand molecules. Some interactions with small ligands that take place in
loops of an RNA structure can be modeled in terms of soft constraints. However, to stay
compatible with the recursive decomposition scheme for secondary structures they are
limited to the unpaired nucleotides of hairpins and internal loops.

The \texttt{RNAlib} library of the \texttt{ViennaRNA Package} implements a most general
form of constraints capability. However, the available programs do not allow for a full
access to the implemented features. Nevertheless, \texttt{RNAfold} provides a convenience
option that allows to easily include ligand binding to hairpin- or interior-loop like aptamer
motifs. For that purpose, a user needs only to provide motif and a binding free energy.

Consider the following example file \texttt{theo.fa} for a theophylline triggered
riboswitch with the sequence
\begin{verbatim}
  >theo-switch
  GGUGAUACCAGAUUUCGCGAAAAAUCCCUUGGCAGCACCUCGCACAUCUUGUUGUC
  UGAUUAUUGAUUUUUCGCGAAACCAUUUGAUCAUAUGACAAGAUUGAG
\end{verbatim}

The theopylline aptamer structure has been actively researched during the last two decades.
\begin{center}
\includegraphics[width=.75\textwidth]{Figures/theo_aptamer.eps}\\
\end{center}
Although the actual aptamer part (marked in blue) is not a simple interior loop, it
can still be modeled as such. It consists of two delimiting base pairs (G,C) at the
5' site, and another (G,C) at its 3' end. That is already enough to satisfy the requirements
for the \texttt{--motif} option of \texttt{RNAfold}. Together with the aptamer sequence
motif, the entire aptamer can be written down in dot-bracket form as

\begin{verbatim}
GAUACCAG&CCCUUGGCAGC
(...((((&)...)))...)
\end{verbatim}

Note here, that we separated the 5' and 3' part from each other using the \texttt{\&}
character. This enables us to omit the variable hairpin end of the aptamer from the
specification in our model.

The only ingredient that is still missing is the actual stabilizing energy contribution
induced by the ligand binding into the aptamer pocket. But several experimental and computational
studies have already determined dissociation constants for this system. Jenison et al. 1994,
for instance, determined a dissociation constant of $K_d = 0.32\mu M$ which, for standard
reference concentration $c = 1 mol/L$, can be translated into a binding free energy

$$\Delta G = RT \cdot \ln \frac{K_d}{c} \approx -9.22~kcal/mol$$

Finally, we can compute the MFE structure for our example sequence

\begin{verbatim}
  $ RNAfold -v --motif "GAUACCAG&CCCUUGGCAGC,(...((((&)...)))...),-9.22" theo.fa
\end{verbatim}

Compare the predicted MFE structure with and without modeling the ligand interaction.
You may also enable partition function computation to compute base pair probabilities,
the centroid structure and MEA structure to investigate the effect of ligand binding
on ensemble diversity.

\subsubsection{G-quadruplexes}
G-Quadruplexes are a common conformation found in G-rich sequences where four runs of
consecutive G's are separated by three short sequence stretches.

\begin{center}
\includegraphics[width=.5\textwidth]{Figures/gquad_pattern.eps}\\
\end{center}

They form local
self-enclosed stacks of G-quartets bound together through 8 Hogsteen-Watson Crick bonds
and further stabilized by a metal ion (usually potassium).

\begin{center}
\includegraphics[width=.85\textwidth]{Figures/gquad.eps}\\
\end{center}

To
acknowledge the competition of regular secondary structure and G-quadruplex formation,
the \texttt{ViennaRNA Package} implements an extension to the default recursion scheme.
For that purpose, G-quadruplexes are simply considered a different type of substructure
that may be incorporated like any other substructure. The free energy of a particular
G-quadruplex at temperature $T$ is determined by a simple energy model

$$E(L, l_{tot}, T) = a(t) \cdot (L - 1) + b(T) \cdot ln(l_{tot} - 2)$$

that only considers the number of stacked layers $L$ and the total size of the three
linker sequences $l_{tot} = l_1 + l_2 + l_3$ connecting the G runs. Linker sequence
and assymetry effects as well as relative strand orientations (parallel, anti-parallel
or mixed) are entirely neglected in this model. The free energy parameters
$$a(T) = H_a + TS_a$$ and
$$b(T) = H_b + TS_b$$ have been determined from experimental UV-melting data taken
from Zhang et al. 2011, Biochemistry.

\texttt{RNAfold} allows one to activate the G-quadruplex implementation by simply
providing the \texttt{-g} switch. G-quadruplexes are then taken into account for
MFE and equilibrium probability computations.

\begin{verbatim}
  $ echo "GGCUGGUGAUUGGAAGGGAGGGAGGUGGCCAGCC" | RNAfold -g -p
  GGCUGGUGAUUGGAAGGGAGGGAGGUGGCCAGCC
  ((((((..........++.++..++.++)))))) (-21.39)
  ((((((..........(..........))))))) [-21.83]
  ((((((..........++.++..++.++)))))) {-21.39 d=0.04}
   frequency of mfe structure in ensemble 0.491118; ensemble diversity 0.08
\end{verbatim}

The resulting structure layout and dot plot \texttt{PostScript} files depict the
prediced G-quadruplexes as hairpin-like loops with additional bonds between the
interacting G's, and green triangles where the color intensity encodes the G-quadruplex
probability, respectively. Have a closer look at the actual G-quadruplex probabilities
by opening the dot plot \textit{PostScript} file with a text browser again.

\begin{center}
\includegraphics[width=.75\textwidth]{Figures/gquad_menon.eps}\\
\end{center}

A better drawing of the predicted G-quadruplex might look as follows

\begin{center}
\includegraphics[width=.5\textwidth]{Figures/gquad_menon_nice.eps}\\
\end{center}

Repeat the above analysis for other RNA sequences that might contain and form
a G-quadruplex, e.g. the human telomerase RNA component hTERC
\begin{verbatim}
  >hTERC
  AGAGAGUGACUCUCACGAGAGCCGCGAGAGUCAGCUUGGCCAAUCCGUGCGGUCGG
  CGGCCGCUCCCUUUAUAAGCCGACUCGCCCGGCAGCGCACCGGGUUGCGGAGGGUG
  GGCCUGGGAGGGGUGGUGGCCAUUUUUUGUCUAACCCUAACUGAGAAGGGCGUAGG
  CGCCGUGCUUUUGCUCCCCGCGCGCUGUUUUUCUCGCUGACUUUCAGCGGGCGGAA
  AAGCCUCGGCCUGCCGCCUUCCACCGUUCAUUCUAGAGCAAACAAAAAAUGUCAGC
  UGCUGGCCCGUUCGCCCCUCCCGGGGACCUGCGGCGGGUCGCCUGCCCAGCCCCCG
  AACCCCGCCUGGAGGCCGCGGUCGGCCCGGGGCUUCUCCGGAGGCACCCACUGCCA
  CCGCGAAGAGUUGGGCUCUGUCAGCCGCGGGUCUCUCGGGGGCGAGGGCGAGGUUC
  AGGCCUUUCAGGCCGCAGGAAGAGGAACGGAGCGAGUCCCCGCGCGCGGCGCGAUU
  CCCUGAGCUGUGGGACGUGCACCCAGGACUCGGCUCACACAUGC
\end{verbatim}

\subsubsection{Single strand binding (SSB) protein interaction}
Similar to the ligand interactions discussed above, a single strand binding
(SSB) protein might bind to consecutively unpaired sequence motifs. To model
such interactions the \texttt{ViennaRNA Package} implements yet another
extension to the folding grammar to cover all cases a protein may bind to,
termed \textit{unstructured domains}. This is in contrast to the ligand binding
example above that uses the soft constraints implementation, and is, therefore,
restricted to unpaired hairpin- and interior-loops.

To make use of this implementation in \texttt{RNAfold} one has to resort
to \textit{command files} again. Here, an unstructured domain (UD) can be easily
added using the following syntax
\begin{verbatim}
UD m e [LOOP]
\end{verbatim}
where \texttt{m} is the sequence motif the protein binds to in IUPAC format,
\texttt{e} is the binding free energy in $kcal/mol$, and the optional \texttt{LOOP}
specifier allows for restricting the binding to particular loop types, e.g.
\texttt{M} for multibranch loops, or \texttt{E} for the exterior loop. See the
syntax for command files above for an overview of all loop types available.

As an example, consider the protein binding experiment taken from Forties and Bundschuh 2010,
Bioinformatics (\url{https://dx.doi.org/10.1093/bioinformatics/btp627}). Here, the authors
investigate a hypothetical unspecific RNA binding protein with a footprint of $6~nt$ and
a binding energy of $\Delta G = -10~kcal/mol$ at $1~M$. With $T = 37^\circ C$ and
$$\Delta G = RT \cdot \ln \frac{K_d}{c}$$
this translates into a dissociation constant of
$$K_d = exp(\Delta G / RT) = 8.983267433 \cdot 10^{-8}.$$
Hence, the binding energies at $50~nM$, $100~nM$, $400~nM$, and $1~\mu M$ are $0.36~kcal/mol$,
$-0.07~kcal/mol$, $-0.92~kcal/mol$, and $-1.49~kcal/mol$, respectively.\

The RNA sequence file \texttt{forties\_bundschuh.fa} for this experiment is
\begin{verbatim}
>forties_bundschuh
CGCUAUAAACCCCAAAAAAAAAAAAGGGGAAAAUAGCG
\end{verbatim}
which yields the following MFE structure
\begin{center}
\includegraphics[width=.5\textwidth]{Figures/forties_ss.eps}\\
\end{center}

To model the protein binding for this example with \texttt{RNAfold} we require
a commands file for each of the concentrations in question. Thus, one simply creates
text files with a single line content
\begin{verbatim}
UD NNNNNN e
\end{verbatim}
where \texttt{e} is the binding free energy at this specific protein concentration
as computed above. Note here, that we use \texttt{NNNNNN} as sequence motif that is
bound by the protein to acknowledge the unspecific interaction between protein and
RNA. Finally, \texttt{RNAfold} is executed to compute equilibrium base pairing and
per-nucleotide protein binding probabilities
\begin{verbatim}
  $ RNAfold -p --commands forties_50nM.txt forties_bundschuh.fa
\end{verbatim}
and the produced probability dot plot can be inspected.
\begin{center}
\includegraphics[width=.5\textwidth]{Figures/forties_50nM_dp.eps}\\
\end{center}
As you can see, the dot plot is augmented with an additional linear array of blue squares
along each side that depicts the probability that the respective nucleotide is bound
by the protein. Now, repeat the computations for different protein concentrations and
compare the probabilities computed with the unstructured domain feature of the
\texttt{ViennaRNA Package} with those in Fig. 3(a) of the publication.

Note, that \texttt{RNAfold} allows for an unlimited number of different proteins
specified in the commands file. This easily allows one to model RNA-protein binding
interaction within a relatively complex solution of different competing proteins.

\subsubsection{Change other model settings}
\texttt{RNAfold} also allows for many other changes of the implemented Nearest Neighbor
model. For instance, you can explicitly prohibit $(G,U)$ pairs, change the temperature
that is used for evaluation of the free energy of particular loops, select a different
dangling-end energy model or load a different set of free energy parameters, e.g. for
DNA or parameters derived from computational optimizations.

See the man pages of \texttt{RNAfold} for a complete overview of all available options
and command line switches. Additional energy parameter collections are distributed together
with the \texttt{ViennaRNA Package} as part of the contents of the \texttt{misc/} directory,
and are typically installed in
\begin{verbatim}
  prefix/share/ViennaRNA
\end{verbatim}
where \texttt{prefix} is the path that was used as installation prefix, e.g.
\texttt{\$HOME/Tutorial/Progs/VRP} (used in this tutorial) or \texttt{/usr} when installed
globally using a package manager.


\subsection{The Program \texttt{RNAplot}}
You can manually add additional annotation to structure drawings using the \texttt{RNAplot}
program (for information see its \texttt{man} page). Here's a somewhat complicated example:

\begin{verbatim}
  $ RNAfold 5S.seq > 5S.fold
  $ RNAplot --pre "76 107 82 102 GREEN BFmark 44 49 0.8 0.8 0.8 Fomark \
    1 15 8 RED omark 80 cmark 80 -0.23 -1.2 (pos80) Label 90 95 BLUE Fomark" < 5S.fold
  $ gv 5S_ss.ps
\end{verbatim}%$
\begin{center}
\includegraphics[width=.75\textwidth]{Figures/5S_ss.eps}\\
\end{center}

\texttt{RNAplot} is a very useful tool to color structure layout plots. The \texttt{--pre} tag adds
PostScript code required to color distinct regions of your molecule. There are some predefined
statements with different options for annotations listed below:

\begin{tabular}{ll}
	\texttt{i cmark} & draws circle around base i\\
	\texttt{i j c gmark} & draw basepair i,j with c counter examples in grey\\
	\texttt{i j lw rgb omark} & stroke segment i...j with linewidth lw and color (rgb)\\
	\texttt{i j rgb Fomark} & fill segment i...j with color (rgb)\\
	\texttt{i j k l rgb BFmark} & fill block between pairs i,j and k,l with color (rgb)\\
	\texttt{i dx dy (text) Label} & adds a textlabel with an offset dx and dy relative to base i\\
\end{tabular}

Predefined color options are \texttt{BLACK, RED, GREEN, BLUE, WHITE} but you can also
replace the value to some standard RGB code (e.g. 0 5 8 for lightblue).\\

To simply add the annotation macros to the \texttt{PostScript} file without
any actual annotation you can use the following program call
\begin{verbatim}
  $ RNAplot --pre "" < 5S.fold
\end{verbatim}

If you now open the structure layout file \texttt{5S\_ss.ps} with a text editor
you'll see the additional macros for \texttt{cmark}, \texttt{omark}, etc.
along with some show synopsis on how to use them. Actual annotations can then be added
between the lines
\begin{verbatim}
\% Start Annotations
\end{verbatim}
and
\begin{verbatim}
\% End Annotations
\end{verbatim}
Here, you simply need to add the same string of commands you would provide through the
\texttt{--pre} option of \texttt{RNAplot}.


To see what exactly the alternative structures of our sequence are, we need
to predict \emph{suboptimal} structures.


\pagebreak[3]
\subsection{The Program \texttt{RNApvmin}}

The program \texttt{RNApvmin} reads a RNA sequence from \textit{stdin} and uses an iterative minimization
process to calculate a perturbation vector that minimizes the discripancies
between predicted pairing probabilites and observed pairing probabilities
(deduced from given shape reactivities).
The experimental SHAPE data has to be present in the file format described above.
The application will write the calculated vector of perturbation energies to \textit{stdout},
while the progress of the minimization process is written to \textit{stderr}.
The resulting perturbation vector can be interpreted directly and gives usefull insights into the
discrepancies between thermodynamic prediction and experimentally determined pairing status.
In addition the perturbation energies can be used to constrain folding with \texttt{RNAfold}:

\begin{verbatim}
$ RNApvmin rna.shape < rna.seq >vector.csv
$ RNAfold --shape=vector.csv --shapeMethod=W < rna.seq
\end{verbatim}%$

The perturbation vector file uses the same file format as the SHAPE data file.
Instead of SHAPE reactivities the raw perturbation energies will be storred in the last column.
Since the energy model is only adjusted when necessary, the calculated perturbation energies may be used
for the interpretation of the secondary structure prediction, since they indicate
which positions require major energy model adjustments in order to yield a prediction
result close to the experimental data. High perturbation energies for just
a few nucleotides may indicate the occurrence of features, which are not explicitly
handled by the energy model, such as posttranscriptional modifications and
intermolecular interactions.

%===
\pagebreak[3]
\subsection{The Program \texttt{RNAsubopt}}
\texttt{RNAsubopt} calculates all suboptimal secondary structures within a
given energy range above the \texttt{MFE} structure. Be careful, the number
of structures returned grows exponentially with both sequence length and
energy range.

%\begin{frame}[fragile]
\frametitle{Suboptimal folding}
\begin{itemize}
\item Generate all suboptimal structures within a certain energy
range from the \texttt{MFE} specified by the \texttt{-e} option.
\begin{verbatim}
$ RNAsubopt -e 1 -s < test.seq
CUACGGCGCGGCGCCCUUGGCGA   -500    100
...........((((...)))).  -5.00
....((((...))))........  -4.80
(((.((((...))))..)))...  -4.20
...((.((.((...)).)).)).  -4.10
\end{verbatim}%$
%\end{frame}
\end{itemize}
\noindent
The text output shows an energy sorted list (option \texttt{-s}) of all
secondary structures within 1~kcal/mol of the \texttt{MFE}
structure. Our sequence actually has a ground state structure (-5.70) and three
structures within 1~kcal/mol range.

\texttt{MFE} folding alone gives no
indication that there are actually a number of plausible structures.
Remember that \texttt{RNAsubopt} cannot automatically plot structures, therefore
you can use the tool \texttt{RNAplot}. Note that you CANNOT simply pipe the
output of \texttt{RNAsubopt} to \texttt{RNAplot} using
\begin{verbatim}
$ RNAsubopt < test.seq | RNAplot
\end{verbatim}
You need to manually create a file for each structure you want to plot. Here, for example we created a new file named suboptstructure.txt:
\begin{verbatim}
> suboptstructure-4.20
CUACGGCGCGGCGCCCUUGGCGA
(((.((((...))))..)))...
\end{verbatim}
The fasta header is optional, but useful (without it the outputfile will be named rna.ps).
The next two lines contain the sequence and the suboptimal structure you want to plot;
in this case we plotted the structure with the folding energy of -4.20.
Then plot it with
\begin{verbatim}
$ RNAplot < suboptstructure.txt
\end{verbatim}

Note that the number of suboptimal structures grows exponentially with
sequence length and therefore this approach is only tractable for
sequences with less than 100 nt. To keep the number of suboptimal
structures manageable the option \texttt{--noLP} can be used, forcing
\texttt{RNAsubopt} to produce only structures without isolated base
pairs. While \texttt{RNAsubopt} produces \emph{all} structures within an
energy range, \texttt{mfold} produces only a few, hopefully representative,
structures. Try folding the sequence on the mfold
server at \\
\url{http://mfold.rna.albany.edu/?q=mfold}.\\

Sometimes you want to get information about unusual properties of the
Boltzmann ensemble (the sum of all RNA structures possible) for which no
specialized program exists. For example you want to know all fractions
of a bacterial mRNA in the Boltzmann ensemble where the Shine-Dalgarno (SD)
 sequence is unpaired. If the SD sequence is concealed
by secondary structure the translation efficiency is reduced.

In such cases you can resort to drawing a representative sample of
structures from the Boltzmann ensemble by using the option
\texttt{-p}. Now you can simply count how many structures in the sample
possess the feature you are looking for. This number divided by the
size of your sample gives you the desired fraction.\\

\noindent
The following example calculates the fraction of structures in the
ensemble that have bases 6 to 8 unpaired.

%===
%\begin{frame}[fragile]
\frametitle{Sampling the Boltzmann Ensemble}
\begin{enumerate}
\item Draw a sample of size 10,000 from the Boltzmann ensemble
\item Calculate the desired property by using a perl script
\end{enumerate}
\begin{verbatim}
$ RNAsubopt -p 10000 < test.seq > tt
$ perl -nle '$h++ if substr($_,5,3) eq "...";
      END {print $h/$.}' tt
      0.391960803919608
\end{verbatim}
%\end{frame}
%$

\noindent
A far better way to calculate this property is to use \texttt{RNAfold -p}
to get the ensemble free energy, which is related to the
partition function via $F = -RT\ln(Q)$, for the unconstrained ($F_u$)
and the constrained case ($F_c$), where the three bases are not
allowed to form base pairs (use option \texttt{-C}), and evaluate $p_c
= \exp((F_u - F_c)/RT)$ to get the desired probability.\\

So let's do the calculation using \texttt{RNAfold}.
\begin{verbatim}
$RNAfold -p

Input string (upper or lower case); @ to quit
....,....1....,....2....,....3....,....4....,....5....,....6....,....7....,....8
CUACGGCGCGGCGCCCUUGGCGA
length = 23
CUACGGCGCGGCGCCCUUGGCGA
...........((((...)))).
 minimum free energy =  -5.00 kcal/mol
....{,{{...||||...)}}}.
 free energy of ensemble =  -5.72 kcal/mol
....................... {  0.00 d=4.66}
 frequency of mfe structure in ensemble 0.311796; ensemble diversity 6.36
\end{verbatim}
\noindent
Now we have calculated the free ensemble energy of the ensemble over all structures (F\_u),
in the next step we have to calculate it for the structures using a constraint(F\_c).\\

\noindent Following notation has to be used for defining the constraint:
\begin{enumerate}
\item $|$ : paired with another base
\item . : no constraint at all
\item x : base must not pair
\item $<$ : base i is paired with a base j<i
\item $>$ : base i is paired with a base j>i
\item matching brackets ( ): base i pairs base j\\
\end{enumerate}

\noindent So our constraint should look like this:
\begin{verbatim}
  .....xxx...............
\end{verbatim}
Next call the application with following command and provide the
sequence and constraint we just created.
\begin{verbatim}
$ RNAfold -p -C
\end{verbatim}
The output should look like this
\begin{verbatim}
length = 23
CUACGGCGCGGCGCCCUUGGCGA
...........((((...)))).
 minimum free energy =  -5.00 kcal/mol
...........((((...)))).
 free energy of ensemble =  -5.14 kcal/mol
...........((((...)))). { -5.00 d=0.42}
 frequency of mfe structure in ensemble 0.792925; ensemble diversity 0.79
\end{verbatim}
\noindent
Afterwards evaluate the desired probability according to the formula given before
e.g. with a simple perl script.
\begin{verbatim}
$ perl -e 'print exp(-(5.72-5.14)/(0.00198*310.15))."\n"'
\end{verbatim}

You can see that there is a slight difference between the \texttt{RNAsubopt} run with 10,000
samples and the \texttt{RNAfold} run including all structures.

\pagebreak[3]
\section{RNA folding kinetics}
RNA folding kinetics describes the dynamical process of how a RNA molecule
approaches to its unique folded biological active conformation (often
referred to as the native state) starting from an initial ensemble of
disordered conformations e.g. the unfolded open chain. The key for
resolving the dynamical behavior of a folding RNA chain lies in the
understanding of the ways in which the molecule explores its astronomically
large free energy landscape, a rugged and complex hyper-surface established
by all the feasible base pairing patterns a RNA sequence can form. The
challenge is to understand how the interplay of formation and break up of
base pairing interactions along the RNA chain can lead to an efficient
search in the energy landscape which reaches the native state of the
molecule on a biologically meaningful time scale.

\subsection{RNA2Dfold}
RNA2Dfold is a tool for computing the MFE structure, partition function and
 representative sample structures of $\kappa$, $\lambda$ neighborhoods and projects an high
dimensional energy landscape of RNA into two dimensions. Therefore a sequence
and two user-defined reference structures are expected by the program.
For each of the resulting distance class, the MFE representative, the
Boltzmann probabilities and the Gibbs free energy is computed. Additionally,
representative suboptimal secondary structures from each partition can be calculated.

\begin{verbatim}
$ RNA2Dfold -p < 2dfold.inp > 2dfold.out
\end{verbatim}

The outputfile \texttt{2dfold.out} should look like below, check it out using \texttt{less}.
\begin{tiny}
\begin{verbatim}
CGUCAGCUGGGAUGCCAGCCUGCCCCGAAAGGGGCUUGGCGUUUUGGUUGUUGAUUCAACGAUCAC
((((((((((....)))))..(((((....))))).)))))...(((((((((...))))))))). (-30.40)
((((((((((....)))))..(((((....))))).)))))...(((((((((...))))))))). (-30.40) <ref 1>
.................................................................. (  0.00) <ref 2>
free energy of ensemble = -31.15 kcal/mol
k       l       P(neighborhood) P(MFE in neighborhood)  P(MFE in ensemble)      MFE     E_gibbs MFE-structure
0       24      0.29435909      1.00000000      0.29435892      -30.40  -30.40  ((((((((((....)))))..(((((....))))).)))))...(((((((((...))))))))).
1       23      0.17076902      0.47069889      0.08038083      -29.60  -30.06  ((((((((((....)))))..(((((....))))).)))))....((((((((...))))))))..
2       22      0.03575448      0.37731068      0.01349056      -28.50  -29.10  ((((.(((((....)))))..(((((....)))))..))))....((((((((...))))))))..
2       24      0.00531223      0.42621709      0.00226416      -27.40  -27.93  ((((((((((....))))...(((((....)))))))))))...(((((((((...))))))))).
3       21      0.00398349      0.29701636      0.00118316      -27.00  -27.75  .(((.(((((....)))))..(((((....)))))..))).....((((((((...))))))))..
3       23      0.00233909      0.26432372      0.00061828      -26.60  -27.42  ((((((((((....))))...(((((....)))))))))))....((((((((...))))))))..
[...]
\end{verbatim}
\end{tiny}

For visualizing the output the ViennaRNA Package includes two scripts
\texttt{2Dlandscape\_pf.gri, 2Dlandscape\_mfe.gri} located in \texttt{VRP/share/ViennaRNA/}.
gri (a language for scientific graphics programing) is needed to create a colored
postscript plot. We use the partition function script to show the free energies of
the distance classes (graph below, left):

\begin{verbatim}
$ gri ../Progs/VRP/share/ViennaRNA/2Dlandscape_pf.gri 2dfold.out
\end{verbatim}

Compare the output file with the colored plot and determine the MFE minima with
corresponding distance classes. For easier comparision the outputfile of \texttt{RNA2Dfold} can be
sorted by a simple sort command. For further information regarding sort use the \texttt{--help} option.
\begin{verbatim}
$ sort -k6 -n 2dfold.out > sort.out
\end{verbatim}
Now we choose the structure with the lowest energy besides our startstructure,
replace the open chain structure from our old input with that structure and repeat the steps above
with our new values

\begin{itemize}
\item run \texttt{RNA2Dfold}
\item plot it using \texttt{2Dlandscape\_pf.gri}\\
\end{itemize}

The new projection (right graph) shows the two major local minima which are separated by 39 bp (red dots in figure below)
and both are likely to be populated with high probability. The landscape gives an estimate of
the energy barrier separating the two minima (about -20 kcal/mol).\\
\noindent
The red dots mark the distance from open chain to the MFE structure respectively the
distance from the 2nd best structure to the MFE. Note that the red dots were manually added to the image afterwards so don't panic if you don't see them in your gri output.
\begin{center}
	\includegraphics[width=.45\textwidth]{Figures/2dfold_out_m.eps}\hfill
	\includegraphics[width=.45\textwidth]{Figures/2dfold_2_out_m.eps}
\end{center}

\subsection{barriers \& treekin}
The following assumes you have the barriers and treekin programs
installed. If not, the current release can be found at
\url{http://www.tbi.univie.ac.at/RNA/Barriers/}. Installation proceeds
as shown for the ViennaRNA Package in section \ref{sect:install}. One
problem that often occurs during treekin installation is the
dependency on \texttt{blas} and \texttt{lapack} packages which is not
carefully checked. For further information according to the barriers
and treekin program also see the website.

\frametitle{A short recall on howto install/compile a program}
\begin{enumerate}
	\item Get the barriers source from \url{http://www.tbi.univie.ac.at/RNA/Barriers/}
	\item extract the archive and go to the directory
\begin{verbatim}
$ tar -xzf Barriers-1.5.2.tar.gz
$ cd Barriers-1.5.2
\end{verbatim}
	\item use the \texttt{--prefix} option to install in your \texttt{Progs} directory
\begin{verbatim}
$ ./configure --prefix=$HOME/Tutorial/Progs/barriers-1.5.2
\end{verbatim}
	\item make install
\begin{verbatim}
$ make
$ make install
\end{verbatim}
\end{enumerate}

Now barriers is ready to use. Apply the same steps to install treekin.
\texttt{Note:} Copy the barriers and treekin binaries to your \texttt{bin}
folder or add the path to your \texttt{PATH} variable.

%\begin{frame}[fragile]
\frametitle{Calculate the Barrier Tree}
\begin{verbatim}
$ echo UCCACGGCUGUUAGUGGAUAACGGC | RNAsubopt --noLP -s -e 10 > barseq.sub
$ barriers -G RNA-noLP --bsize --rates < barseq.sub > barseq.bar
\end{verbatim}%$
You can restrict the number of local minima using the \texttt{barriers}
command-line option \texttt{--max} followed by a number. The option \texttt{-G RNA-noLP}
instructs barriers that the input consists of RNA secondary structures without isolated
 basepairs. \texttt{--bsize} adds size of the gradient basins and \texttt{--rates} tells
barriers to compute rates between macro states/basins for use with treekin. Another useful
options is \texttt{--minh} to print only minima with a barrier $> dE$. Look at the
output file \texttt{less -S barseq.bar}.  Use the arrow keys to navigate.
\begin{small}
\begin{verbatim}
  UCCACGGCUGUUAGUGGAUAACGGC
1 (((((........))))).......  -6.90    0  10.00    115     0  -7.354207     23  -7.012023
2 ......(((((((.....)))))))  -6.80    1   9.30     32    58  -6.828221     38  -6.828218
3 (((...(((...)))))).......  -0.80    1   0.90      1    10  -0.800000      9  -1.075516
4 ....((..((((....)))).))..  -0.80    1   2.70      5    37  -0.973593     11  -0.996226
5 .........................   0.00    1   0.40      1    14  -0.000000     26  -0.612908
6 ......(((....((.....)))))   0.60    2   0.40      1    22   0.600000      3   0.573278
7 ......((((((....)))...)))   1.00    1   1.50      1    95   1.000000      2   0.948187
8 .((....((......)).....)).   1.40    1   0.30      1    30   1.400000      2   1.228342
\end{verbatim}

The first row holds the input sequence, the successive list the local
minima ascending in energy. The meaning of the first 5 columns is as follows
\begin{enumerate}
\item label (number) of the local minima (1=MFE)
\item structure of the minimum
\item free energy of the minimum
\item label of deeper local minimum the current minimum merges with (note that the
  \texttt{MFE} has no deeper local minimum to merge with)
\item height of the energy barrier to the local minimum to merge with
\item numbers of structures in the basin we merge with
\item number of basin which we merge to
\item free energy of the basin
\item number of structures in this basin using gradient walk
\item gradient basin (consisting of all structures where gradientwalk ends in the minimum)
\end{enumerate}
\end{small}
%\end{frame}
\frametitle{Calculate The Barrier Tree}
\begin{center}
  \includegraphics[width=.5\textwidth]{Figures/tree.eps}
\end{center}
\texttt{barriers} produced two additional files, the \texttt{PostScript}
file \texttt{tree.eps} which represents the basic information of the
\texttt{barseq.bar} file visually (look at the file e.g. \texttt{gv tree.eps})
and a text file \texttt{rates.out} which holds the matrix of transition
probabilities between the local minima.
%\end{frame}

%\begin{frame}
\frametitle{Simulating the Folding Kinetics}
\noindent
The program \texttt{treekin} is used to simulate the evolution over time of the
population densities of local minima starting from an initial population
density distribution $p0$ (given on the command-line) and the
transition rate matrix in the file \texttt{rates.out}.
\begin{verbatim}
$ treekin -m I --p0 5=1 < barseq.bar | xmgrace -log x -nxy -
\end{verbatim}%$
\begin{center}
  \includegraphics[width=.5\textwidth]{Figures/FOO.eps}\hfill
  \includegraphics[width=.40\textwidth]{Figures/FOO_dp.eps}
\end{center}
The simulation starts with all the population density in the open chain
(local minimum 5, see \texttt{barseq.bar}). Over time the population density of this state decays
(yellow curve) and other local minima get populated. The simulation ends
with the population densities of the thermodynamic equilibrium in which the
MFE (black curve) and local minimum 2 (red curve) are the only ones
populated. (Look at the dot plot of the sequence created with \texttt{RNAsubopt}
and \texttt{RNAfold}!)
%\end{frame}

%===
\pagebreak[4]
\section{Sequence Design}
\subsection{The Program \texttt{RNAinverse}}
\texttt{RNAinverse} searches for sequences folding into a predefined
structure, thereby inverting the folding algorithm. Input consists of the
target structures (in dot-bracket notation) and a starting sequence,
which is optional.\\

Lower case characters in the start sequence indicate fixed positions,
i.e. they can be used to add sequence constraints. '\texttt{N}'s in the
starting sequence will be replaced by a random nucleotide.
For each search the best sequence found and its Hamming distance to the
start sequence are printed to \textit{stdout}. If the the search was
unsuccessful a structure distance to the target is appended.\\

By default the program stops as soon as it finds a sequence that has the
target as MFE structure. The option \texttt{-Fp} switches
\texttt{RNAinverse} to the partition function mode where the probability of
the target structure $\exp(-E(S)/RT)/Q$ is maximized.  This tends to produce
sequences with a more well-defined structure.
This probability is written in dot-brackets after the found sequence and Hamming
distance. With the option \texttt{-R} you can specify how often the search
should be repeated.

%===
%\begin{frame}[fragile]
  \frametitle{Sequence Design}

\begin{enumerate}
\item Prepare an input file \texttt{inv.in} containing the target
structure and sequence constraints
\begin{verbatim}
  (((.(((....))).)))
  NNNgNNNNNNNNNNaNNN
\end{verbatim}
\item Design sequences using RNAinverse
\begin{verbatim}
$ RNAinverse < inv.in
        GGUgUUGGAUCCGAaACC    5

$ RNAinverse -R5 -Fp < inv.in
        GGUgUGAACCCUCGaACC    5
        GGCgCCCUUUUGGGaGCC   12  (0.967418)
        CUCgAUCUCACGAUaGGG    6
        GGCgCCCGAAAGGGaGCC   13  (0.967548)
        GUUgAGCCCAUGCUaAGC    6
        GGCgCCCUUAUGGGaGCC   10  (0.967418)
        CGGgUGUUGUGACAaCCG    5
        GCGgGUCGAAAGGCaCGC   12  (0.925482)
        GCCgUAUCCGGGUGaGGC    6
        GGCgCCCUUUUGGGaGCC   13  (0.967418)

\end{verbatim}
\end{enumerate}
\noindent
The output consists of the calculated sequence and the number of mutations
needed to get the MFE-structure from the start sequence (start sequence not shown).
Additionaly, with the partition function folding (\texttt{-Fp}) set, the second
output is another refinement so that the ensemble preferes the MFE and folds
into your given structure with a distinct probability, shown in brackets.\\
%\end{frame}

Another useful program for inverse folding is \texttt{RNA designer}, see
\url{http://www.rnasoft.ca/}. RNA Designer takes a secondary structure
description as input and returns an RNA strand that is likely to fold in the
given secondary structure.

The \texttt{sequence design application} of the \texttt{ViennaRNA Design Webservices},
see \url{http://nibiru.tbi.univie.ac.at/rnadesign/index.html} uses a different approach,
allowing for more than one secondary structure as input. For more detail read the online
Documentation and the next section of this tutorial.

%\TODO{install package with RNA.pm}
\subsection{switch.pl}
The \texttt{switch.pl} script can be used to design bi-stable structures,
i.e. structures with two almost equally good foldings. For two given structures
there are always a lot of sequences compatible with both structures. If both
structures are reasonably stable you can find sequences where both target
structures have almost equal energy and all other structures have much higher energies.
Combined with RNAsubopt, barriers and treekin, this is a very useful tool for
designing RNAswitches. \\

The input requires two structures in dot-bracket notation
and additionally you can add a sequence. It is also possible to calculate the
switching function at two different temperatures with option \texttt{-T} and \texttt{-T2}.

\frametitle{Designing a Switch}
\noindent Now we try to create an RNA switch using \texttt{switch.pl}.
First we create our inputfile, then invoke the program using ten optimization runs
(\texttt{-n 10}) and do not allow lonely pairs. Write it out to \texttt{switch.out}
\begin{verbatim}
switch.in
     ((((((((......))))))))....((((((((.......))))))))
     ((((((((((((((((((........)))))))))))))))))).....

$ switch.pl -n 10 --noLP < switch.in > switch.out
\end{verbatim}

\texttt{switch.out} should look similar like this, the first block represents our
bi-stable structures in random order, the second block shows the resulting sequences ordered by
their score.
\begin{verbatim}
$ less switch.out

GGGUGGACGUUUCGGUCCAUCCUUACGGACUGGGGCGUUUACCUAGUCC   0.9656
CAUUUGGCUUGUGUGUCGAAUGGCCCCGGUACGUAGGCUAAAUGUACCG   1.2319
GGGGGGUGCGUUCACACCCCUCAUUUGGUGUGGAUGUGCUUUCUACACU   1.1554
[...]
the resulting sequences are:
CAUUUGGCUUGUGUGUCGAAUGGCCCCGGUACGUAGGCUAAAUGUACCG   1.2319
GGGGGGUGCGUUCACACCCCUCAUUUGGUGUGGAUGUGCUUUCUACACU   1.1554
CGGGUUGUAACUGGAUAGCCUGGAAACUGUUUGGUUGUAAUCCGAACAG   1.0956
[...]
\end{verbatim}

Given all 10 suggestions in our \texttt{switch.out}, we select the one with the best score
with some command line tools to use it as an \texttt{RNAsubopt} input file and build up the barriers tree.
\begin{verbatim}
$ tail -10 switch.out | awk '{print($1)}'  | head -n 1 > subopt.in
$ RNAsubopt --noLP -s -e 25 < subopt.in > subopt.out
$ barriers -G RNA-noLP --bsize --rates --minh 2 --max 30 < subopt.out > barriers.out
\end{verbatim}

\texttt{tail -10} cuts the last 10 lines from the \texttt{switch.out} file and pipes them into
an \texttt{awk} script. The function \texttt{print(\$1)} echoes only the first column and this
is piped into the \texttt{head} program where the first line, which equals the best scored
sequence, is taken and written into \texttt{subopt.in}. Then \texttt{RNAsubopt} is called
to process our sequence and write the output to another file which is the input for the
barriers calculation.\\

Below you find an example of the barriertree calculation above done with the right settings (connected root)
on the left side and the wrong \texttt{RNAsubobt -e} value on the right. Keep in mind that
\texttt{switch.pl} performs an stochastic search and the output sequences are different every time
because there are a lot of sequences which fit the structure and switch calculates a new one
everytime. Simply try to make sure.\\

\includegraphics[width=.27\textheight]{Figures/switch_barriertree.eps}\hfill
\includegraphics[trim=-1.5cm 0cm 0cm 0cm,width=.27\textheight]{Figures/switch_barriertree_e13.eps}
\begin{tiny}
left: Barriers tree as it should look like, all branches connected to the main root
right: disconnected tree due to a too low \texttt{energy range (-e)} parameter set in \texttt{RNAsubopt}.
\end{tiny}\\

Be careful to set the range -e high enough, otherwise we get a problem when calculation
the kinetics using treekin. Every branch should be somehow connected to the main root of the tree.
Try \texttt{-e 20} and \texttt{-e 30} to see the
difference in the trees and choose the optimal value. By using \texttt{--max 30} we
shorten our tree to focus only on the lowest minima.
We then select a branch preferably outside of the two main branches, here branch 30 (may differ from your own
calculation). Look at the barrier tree to find the best branch to start and replace \texttt{30} by the
branch you would choose. Now use treekin to plot concentration kinetics and think about the graph you
just created.
\begin{verbatim}
$ treekin -m I --p0 30=1  < barriers.out > treekin.out
$ xmgrace -log x -nxy treekin.out
\end{verbatim}
The graph could look like the one below, remember everytime you use \texttt{switch.pl} it can give you different
sequences so the output varies too. Here the one from the example.
\begin{center}
\includegraphics[trim=0cm 1.5cm 0cm -2.5cm, width=.45\textheight]{Figures/switch_treekin.eps}\\
\end{center}
%===
\pagebreak[3]
\section{RNA-RNA Interactions}
A common problem is the prediction of binding sites between two RNAs, as in
the case of miRNA-mRNA interactions. Following tools of the \texttt{ViennaRNA Package}
can be used to calculate base pairing probabilities.

\subsection{The Program \texttt{RNAcofold}}
\texttt{RNAcofold} works much like \texttt{RNAfold} but uses
two RNA sequences as input which are then allowed to form a dimer
structure. In the input the two RNA sequences should be concatenated using
the `\texttt{\&}' character as separator.
As in \texttt{RNAfold} the \texttt{-p} option can be used to compute
partition function and base pairing probabilities.\\

Since dimer formation is concentration dependent, \texttt{RNAcofold}
can be used to compute equilibrium concentrations for all five monomer
and (homo/hetero)-dimer species, given input concentrations for the
monomers (see the \texttt{man} page for details).
%===
%\begin{frame}[fragile]
  \frametitle{Two Sequences one Structure}

  \begin{enumerate}
  \item Prepare a sequence file (\texttt{t.seq}) for input that looks like this
\begin{verbatim}
>t
GCGCUUCGCCGCGCGCC&GCGCUUCGCCGCGCGCA
\end{verbatim}
  \item Compute the \texttt{MFE} and the ensemble properties
  \item Look at the generated PostScript files \texttt{t\_ss.ps}
  and \texttt{t\_dp.ps}
  \end{enumerate}
\begin{verbatim}
$ RNAcofold -p < t.seq
>t
GCGCUUCGCCGCGCGCC&GCGCUUCGCCGCGCGCA
((((..((..((((...&))))..))..))))... (-17.70)
((((..{(,.((((,,.&))))..}),.)))),,. [-18.26]
frequency of mfe structure in ensemble 0.401754 , delta G binding= -3.95
\end{verbatim}%$
%\end{frame}


%===
%\begin{frame}
  \frametitle{Secondary Structure Plot and Dot Plot}
  \includegraphics[width=.25\textheight]{Figures/t_ss.eps}\hfill
  \includegraphics[width=.5\textwidth]{Figures/t_dp.eps}\\
%\end{frame}
In the dot plot a cross marks the chain break between the two concatenated sequences.

\subsection{Concentration Dependency}
 Cofolding is an intermolecular process, therefore whether
 duplex formation will actually occur is concentration dependent.
 Trivially, if one of the molecules is not present, no dimers are going to
 be formed. The partition functions of the molecules give us the
 equilibrium constants:
 \begin{equation*}
   K_{AB} = \frac{[AB]}{[A][B]} = \frac{Z_{AB}}{Z_AZ_B}
 \end{equation*}
 with these and mass conservation, the equilibrium concentration of
 homodimers, heterodimers and monomers can be computed in dependence
 of the start concentrations of the two molecules.
 This is most easily done by creating a file with the
 initial concentrations of molecules $A$ and $B$ in two columns:\\
 \begin{eqnarray*}
   [a_1]([mol/l])  & [b_1]([mol/l])\cr
   [a_2]([mol/l])  & [b_2]([mol/l])\cr
       \vdots & \cr
   [a_n]([mol/l])  & [b_n]([mol/l])
 \end{eqnarray*}

%\begin{frame}
  \frametitle{Concentration Dependency}
  \begin{enumerate}
  \item Prepare a concentration file for input with this little perl script
\begin{verbatim}
$ perl -e '$c=1e-07; do {print "$c\t$c\n"; $c*=1.71;} while $c<0.2' > concfile
\end{verbatim}
\noindent
This script creates a file displaying values from 1e-07 to just below 0.2, with 1.71-fold steps
in between. For convenience, concentration of molecule A is the same as concentration of
molecule B in each row. This will facilitate visualization of the results.
  \item Compute the \texttt{MFE}, the ensemble properties
    and the concentration dependency of hybridization.
\begin{verbatim}
$ RNAcofold -f concfile < t.seq > cofold.out
\end{verbatim}
  \item Look at the generated output with
\begin{verbatim}
$ less cofold.out
\end{verbatim}
  \end{enumerate}
\begin{scriptsize}
\begin{verbatim}
[...]
Free Energies:
AB              AA              BB              A               B
-18.261023      -17.562553      -18.274376      -7.017902       -7.290237
Initial concentrations          relative Equilibrium concentrations
A                B               AB              AA              BB              A               B
1e-07           1e-07           0.00003         0.00002         0.00002         0.49994         0.49993
[...]
\end{verbatim}
\end{scriptsize}
% \end{frame}

\noindent
The five different free energies were printed out first, followed
by a list of all the equilibrium concentrations, where the first two columns denote the initial
(absolute) concentrations of molecules $A$ and $B$, respectively. The next
five columns denote the equilibrium concentrations of dimers and monomers,
relative to the total particle number. (Hence, the concentrations don't add
up to one, except in the case where no dimers are built -- if you want to
know the fraction of particles in a dimer, you have to take the relative
dimer concentrations times 2).\\
Since relative concentrations of species depend on two independent values -
initial concentration of A as well as initial concentration of B - it is not
trivial to visualize the results. For this reason we used the same concentration
for A and for B. Another possibility would be to keep the initial concentration of
one molecule constant.
As an example we show the following plot of
$t.seq$.
%\begin{frame}[fragile]
Now we use some commandline tools to render our plot. We use \texttt{tail -n +11} to
show all lines starting with line 11 (1-10 are cut) and pipe it into an \texttt{awk} command, which
prints every column but the first from our input.
This is then piped to \texttt{xmgrace}.
With \texttt{-log x -nxy -} we tell it to plot the x axis in logarithmic scale and to read data file in X Y1 Y2 ... format.

\begin{verbatim}
$ tail -n +11 cofold.out | awk '{print $2, $3, $4, $5, $6, $7}' | xmgrace -log x -nxy -
\end{verbatim}

\frametitle{Concentration Dependency plot}
\begin{center}
\begin{figure}[h]
\includegraphics[trim= 0cm 0cm 0cm -2.2cm,width=.70\textwidth]{Figures/tconcdep.eps}\hfill
\end{figure}
\end{center}
  $\Delta G_{\text{binding}}=-5.01$ kcal/mol
\begin{verbatim}
  sequences:GCGCUUCGCCGCGCGCG&GCGCUUCGCCGCGCGCG
\end{verbatim}
%\end{frame}
Since the two sequences are almost identical, the monomer and homo-dimer
concentrations behave very similarly.
In this example, at a concentration of about 1 mmol 50\% of the molecule
is still in monomer form.
%===

\pagebreak[2]
\subsection{Finding potential binding sites with \tt RNAduplex}

If the sequences are very long (many kb)\texttt{RNAcofold} is too slow to be useful.
The \texttt{RNAduplex} program is a fast alternative, that works by
predicting \emph{only} intermolecular base pairs. It's almost as fast
as simple sequence alignment, but much more accurate than a \texttt{BLAST}
search.

The example below searches the 3' UTR of an mRNA for a miRNA binding site.

%\begin{frame}[fragile]
  \frametitle{Binding site prediction with \tt RNAduplex}
  The file \texttt{duplex.seq} contains the 3'UTR of NM\_024615 and the
  microRNA mir-145.
  \begin{small}
\begin{verbatim}
$ RNAduplex < duplex.seq
>NM_024615
>hsa-miR-145
.(((((.(((...((((((((((.&)))))))))))))))))).  34,57  :   1,19  (-21.90)
\end{verbatim}%$
  \end{small}
  Most favorable binding has an interaction energy of -21.90 kcal/mol and pairs up on positions
  34-57 of the UTR with positions 1-22 of the miRNA.\\
%\end{frame}

\texttt{RNAduplex} can also produce alternative binding sites, e.g. running
\texttt{RNAduplex -e 10} would list all binding sites within 10 kcal/mol of
the best one.

Since \texttt{RNAduplex} forms only intermolecular pairs, it neglects the
competition between intramolecular folding and hybridization. Thus, it is
recommended to use \texttt{RNAduplex} as a pre-filter and analyse good
\texttt{RNAduplex} hits additionally with \texttt{RNAcofold} or
\texttt{RNAup}. Using the example above, running \texttt{RNAup} will yield:

%\begin{frame}[fragile]
%  \frametitle{Binding site prediction with \tt RNAduplex}
  \begin{small}
\begin{verbatim}
$ RNAup -b < duplex.seq

>NM_024615
>hsa-miR-145
(((((((&)))))))  50,56  :   1,7   (-8.41 = -9.50 + 0.69 + 0.40)
GCUGGAU&GUCCAGU
RNAup output in file: hsa-miR-145_NM_024615_w25_u1.out
\end{verbatim}%$
  \end{small}
  The free energy of the duplex is -9.50 kcal/mol and shows a discrepancy to the
  structure and energy value computed by \texttt{RNAduplex} (differences may arise from
  the fact that RNAup computes partition functions rather than optimal structures).
  However, the total free energy of binding is less favorable (-8.41 kcal/mol), since it
  includes the energetic penalty for opening the binding site on the mRNA
  (0.69 kcal/mol) and miRNA (0.40 kcal/mol). The \texttt{-b} option includes the
  probability of unpaired regions in both RNAs.

%\end{frame}

You can also run \texttt{RNAcofold} on the example to see the complete
structure after hybridization (neither \texttt{RNAduplex} nor
\texttt{RNAup} produce structure drawings). Note however, that the input
format for \texttt{RNAcofold} is different. An input file suitable for
\texttt{RNAcofold} has to be created from the \texttt{duplex.seq} file first (use any text editor).\\


As a more difficult example, let's look at the interaction of the bacterial smallRNA RybB and its
target mRNA ompN. First we'll try predicting the binding site using RNAduplex:
\begin{small}
\begin{verbatim}
$ RNAduplex < RybB.seq
>RybB
>ompN
.((((..((((((.(((....((((((((..(((((.((..((.((....((((..(((((((((((..((((((&
.))))))..))))))).)))).....))))....)).)).)).))).))..))))........))))..))).)))))).)))).
 5,79  :  80,164 (-34.60)
\end{verbatim}%$
\end{small}

Note, that the predicted structure spans almost the full length of the RybB small RNA.
Compare the predicted interaction to the structures predicted for RybB and ompN alone, and ask
yourself whether the predicted interaction is indeed plausible.

Below the structure of ompN on the left and RybB on the right side. The respective binding regions
predicted by RNAduplex are marked in red.

\begin{center}
  \includegraphics[width=0.65\linewidth]{Figures/ompN_ss.eps}
  \includegraphics[width=0.50\linewidth]{Figures/RybB_ss.eps}
\end{center}
\begin{tiny}
\begin{verbatim}
GCCAC-----TGCTTTTCTTTGATGTCCCCATTTT-GTGGA-------GC-CCATCAACCCCGCCATTTCGGTT---CAAG-GTTGGTGGGTTTTTT
 |||      ||||  |||||| |||    ||||| ||||        || ||| || ||  ||    ||||     |||| ||  |||  |||||| -40.30
AGGTCAAACAACGGC-AGAAACAATATT--TAAAGTCGCCGCACACGACGCGGTCGTCGGT-CGTCTCGGCCCTACTGTTCACGGTTATGAAAAGAAACC-3'
\end{verbatim}
\end{tiny}
\noindent
Compare the \texttt{RNAduplex} prediction with the interaction predicted by \texttt{RNAcofold}, \texttt{RNAup} and the handcrafted prediction you see above.

\begin{center}
  \includegraphics[width=0.7\linewidth]{Figures/OmpN_cofold.eps}
\end{center}

\pagebreak[3]
%===
%\input{alifold}
\section{Consensus Structure Prediction}
Sequence co-variations are a direct consequence of RNA base pairing
rules and can be deduced to alignments. RNA helices normally contain
only 6 out of the 16 possible
combinations: the Watson-Crick pairs \textsf{GC}, \textsf{CG},
\textsf{AU}, \textsf{UA}, and the somewhat weaker wobble pairs
\textsf{GU} and \textsf{UG}. Mutations in helical regions therefore
have to be correlated. In particular we often find ``compensatory
mutations'' where a mutation on one side of the helix is compensated
by a second mutation on the other side, e.g.\ a
\textsf{C}$\cdot$\textsf{G} pair changes into a
\textsf{U}$\cdot$\textsf{A} pair. Mutations where only one pairing
partner changes (such as \textsf{C}$\cdot$\textsf{G} to
\textsf{U}$\cdot$\textsf{G}) are termed ``consistent mutations''.

\subsection{The Program \texttt{RNAalifold}}
\texttt{RNAalifold} generalizes the folding algorithm for sequence
alignments, treating the entire alignment as a single ``generalized
sequence''.  To assign an energy to a structure on such a generalized
sequence, the energy is simply averaged over all sequences in the
alignment. This average energy is augmented by a covariance term, that
assigns a bonus or penalty to every possible base pair $(i,j)$ based
on the sequence variation in columns $i$ and $j$ of the alignment.\\

Compensatory mutations are a strong indication of structural
conservation, while consistent mutations provide a weaker signal. The
covariance term used by \texttt{RNAalifold} therefore assigns a bonus
of 1 kcal/mol to each consistent and 2 kcal/mol for each compensatory
mutation. Sequences that cannot form a standard base pair incur a
penalty of $-1$ kcal/mol. Thus, for every possible consensus pair
between two columns $i$ and $j$ of the alignment a covariance score
$C_{ij}$ is computed by counting the fraction of sequence pairs
exhibiting consistent and compensatory mutations, as well as the
fraction of sequences that are inconsistent with the pair. The weight
of the covariance term relative to the normal energy function, as well
as the penalty for inconsistent mutations can be changed via command
line parameters.\\

Apart from the covariance term, the folding algorithm in
\texttt{RNAalifold} is essentially the same as for single sequence
folding. In particular, folding an alignment containing just one
sequence will give the same result as single sequence folding using
\texttt{RNAfold}. For $N$ sequences of length $n$ the required CPU
time scales as $\mathcal{O}(N\cdot n^2 + n^3)$ while memory
requirements grow as the square of the sequence length. Thus
\texttt{RNAalifold} is in general faster than folding each sequence
individually. The main advantage, however, is that the accuracy of
consensus structure predictions is generally much higher than for
single sequence folding, where typically only between 40\% and 70\% of
the base pairs are predicted correctly.\\

Apart from prediction of \texttt{MFE} structures \texttt{RNAalifold}
also implements an algorithm to compute the partition function over
all possible (consensus) structures and the thermodynamic equilibrium
probability for each possible pair. These base pairing probabilities
are useful to see structural alternatives, and to distinguish well
defined regions, where the predicted structure is most likely correct,
from ambiguous regions.\\

As a first example we'll produce a consensus structure prediction for
the following four tRNA sequences.
\begin{verbatim}
$ cat four.seq
\end{verbatim}
\vspace*{-3ex}
\begin{scriptsize}
\begin{verbatim}
>M10740 Yeast-PHE
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
>K00349 Drosophila-PHE
GCCGAAAUAGCUCAGUUGGGAGAGCGUUAGACUGAAGAUCUAAAGGUCCCCGGUUCAAUCCCGGGUUUCGGCA
>K00283 Halobacterium volcanii Lys-tRNA-1
GGGCCGGUAGCUCAUUUAGGCAGAGCGUCUGACUCUUAAUCAGACGGUCGCGUGUUCGAAUCGCGUCCGGCCCA
>AF346993
CAGAGUGUAGCUUAACACAAAGCACCCAACUUACACUUAGGAGAUUUCAACUUAACUUGACCGCUCUGA
\end{verbatim}%$
\end{scriptsize}

\vspace*{3ex}\noindent
\texttt{RNAalifold} uses aligned sequences as input. Thus, our first
step will be to align the sequences. We use \texttt{clustalw2} in this
example, since it's one of the most widely used alignment programs and
has been shown to work well on structural RNAs. Other alignment
programs can be used (including programs that attempt to do structural
alignment of RNAs), but the resulting multiple sequence alignment must
be in \texttt{Clustal} format. Get \texttt{clustalw2} and install it as you have done it with the other packages: \url{http://www.clustal.org/clustal2}

%===
%\begin{frame}[fragile]
 \frametitle{Consensus Structure from related Sequences}

  \begin{enumerate}
  \item Prepare a sequence file (use file \texttt{four.seq} and copy it to your working directory)
  \item Align the sequences
  \item Compute the consensus structure from the alignment
  \item Inspect the output files \texttt{alifold.out}, \texttt{alirna.ps},
    \texttt{alidot.ps}
  \item For comparison fold the sequences individually using \texttt{RNAfold}
  \end{enumerate}
\begin{verbatim}
$ clustalw2 four.seq > four.out
\end{verbatim}
\vspace*{-5ex}
\texttt{Clustalw2} creates two more output files, \texttt{four.aln} and \texttt{four.dnd}. For \texttt{RNAalifold} you need the \texttt{.aln} file.
\begin{verbatim}
$ RNAalifold -p four.aln
$ RNAfold -p < four.seq
\end{verbatim}%$
\vspace*{-3ex}
\texttt{RNAalifold} output:
\begin{scriptsize}
\begin{verbatim}
__GCCGAUGUAGCUCAGUUGGG_AGAGCGCCAGACUGAAAAUCAGAAGGUCCCGUGUUCAAUCCACGGAUCCGGCA__
..(((((((..((((.........)))).(((((.......))))).....(((((.......))))))))))))...
 minimum free energy = -15.12 kcal/mol (-13.70 +  -1.43)
..(((((({..((((.........)))).(((((.......))))).....(((((.......)))))}))))))...
 free energy of ensemble = -15.75 kcal/mol
 frequency of mfe structure in ensemble 0.361603
..(((((((..((((.........)))).(((((.......))))).....(((((.......))))))))))))... -15.20 {-13.70 +  -1.50}
\end{verbatim}
\end{scriptsize}
\texttt{RNAfold} output:
\begin{scriptsize}
\begin{verbatim}
>M10740 Yeast-PHE
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
((((((((........((((.((((((..((((...........))))..))))))..))))..)))))))). (-21.60)
((((((({...,,.{,((((.((((((..((((...........))))..))))))..))))),)))))))). [-23.20]
((((((((.........(((.((((((..((((...........))))..))))))..)))...)))))))). {-20.00 d=9.63}
 frequency of mfe structure in ensemble 0.0744065; ensemble diversity 15.35
>K00349 Drosophila-PHE
[...]
\end{verbatim}
\end{scriptsize}
%\end{frame}

\noindent
The output contains a consensus sequence and the consensus structure
in dot-bracket notation. The consensus structure has an energy of
$-15.12$~kcal/mol, which in turn consists of the average free energy of the
structure $-13.70$~kcal/mol and the covariance term $-1.43$~kcal/mol. The
strongly negative covariance term shows that there must be a fair number of
consistent and compensatory mutations, but in contrast to the average free
energy it's not meaningful in the biophysical sense.

Compare the predicted consensus structure with the structures predicted for
the individual sequences using \texttt{RNAfold}. How often is the correct
``clover-leaf'' shape predicted?

For better visualization, a structure annotated alignment or color annotated structure drawing can be
generated by using the \texttt{--aln} and \texttt{--color} options of \texttt{RNAalifold}.
\begin{verbatim}
$ RNAalifold --color --aln four.aln
$ gv aln.ps &
$ gv alirna.ps &
\end{verbatim}%$
%===
%\begin{frame}
  \frametitle{\texttt{RNAalifold} Output Files}
  \begin{center}
\begin{scriptsize}
\begin{verbatim}
4 sequence; length of alignment 78
alifold output
    6    72  0  99.8%   0.007 GC:2    GU:1    AU:1
   33    43  0  98.9%   0.033 GC:2    GU:1    AU:1
   31    45  0  99.0%   0.030 CG:3    UA:1
   15    25  0  98.9%   0.045 CG:3    UA:1
    5    73  1  99.7%   0.008 CG:2    GC:1
   13    27  0  99.1%   0.042 CG:4
   14    26  0  99.1%   0.042 UA:4
    4    74  1  99.5%   0.015 CG:3
[...]
\end{verbatim}
\end{scriptsize}
\end{center}

%  \centerline{\includegraphics[width=.6\textwidth]{Figures/aout.eps}}

\vspace*{1ex}
  \includegraphics[width=.48\textwidth]{Figures/alirna.eps}
  \includegraphics[width=.50\textwidth]{Figures/alidot.eps}
%\end{frame}

\noindent
The last output file produced by \texttt{RNAalifold -p}, named
\texttt{alifold.out}, is a plain text file with detailed information
on all plausible base pairs sorted by the likelihood of the pair.  In
the example above we see that the pair $(6,72)$ has no inconsistent
sequences, is predicted almost with probability 1, and occurs as a
\textsf{GC} pair in two sequences, a \textsf{GU} pair in one, and a
\textsf{AU} pair in another.

\noindent
\texttt{RNAalifold} automatically produces a drawing of the consensus
structure in Postscript format and writes it to the file
``\texttt{alirna.ps}''.  In the structure graph consistent and compensatory
mutations are marked by a circle around the variable base(s),
i.e. pairs where one pairing partner is encircled exhibit consistent
mutations, whereas pairs supported by compensatory mutations have both
bases marked. Pairs that cannot be formed by some of the sequences are
shown gray instead of black.
In the example given, many pairs show such inconsistencies. This is because
one of the sequences (AF346993) is not aligned well by \texttt{clustalw}.

Note, that subsequent calls to \texttt{RNAalifold} will overwrite any
existing output \texttt{alirna.ps} (\texttt{alidot.ps},
\texttt{alifold.out}) files in the current directory. Be sure to
rename any files you want to keep.

\subsubsection{Structure predictions for the individual sequences}

The consensus structure computed by \texttt{RNAalifold} will contain only
pairs that can be formed by most of the sequences. The structures of the
individual sequences will typically have additional base pairs that are not
part of the consensus structure. Moreover, ncRNA may exhibit a highly
conserved core structure while other regions are more variable. It may
therefore be desirable to produce structure predictions for one particular
sequence, while still using covariance information from other sequences.

This can be accomplished by first computing the consensus structure for all
sequences using \texttt{RNAalifold}, then folding individual sequences using
\texttt{RNAfold -C\,} with the consensus structure as a constraint. In
constraint folding mode \texttt{RNAfold -C\,} allows only base pairs to form
which are compatible with the constraint structure. This resulting
structure typically contains most of the constraint (the consensus
structure)  plus some additional pairs that are specific for this sequence.

%\begin{frame}
\frametitle{Refolding Individual Sequences}
The \texttt{refold.pl} (find it in the Progs folder) script removes gaps and maps the consensus structure to
each individual sequence.
\begin{scriptsize}
\begin{small}
\begin{verbatim}
$ RNAalifold  RNaseP.aln > RNaseP.alifold
$ gv alirna.ps
$ refold.pl RNaseP.aln RNaseP.alifold | head -3 > RNaseP.cfold
$ RNAfold -C --noLP < RNaseP.cfold > RNaseP.refold
$ gv E-coli_ss.ps
\end{verbatim} %$
\end{small}
\end{scriptsize}
%\end{frame}

If you compare the refolded structure (\verb+E-coli_ss.ps+) with the
structure you get by simply folding the E.coli sequence in the
RNaseP.seq file (\verb+RNAfold --noLP+) you find a clear
rearrangement.

In cases where constrained folding results in a structure that is very
different from the consensus, or if the energy from constrained
folding is much worse than from unconstrained folding, this may
indicate that the sequence in question does not really share a common
structure with the rest of the alignment or is misaligned. One should
then either remove or re-align that sequence and recompute the
consensus structure.

Note that since RNase P forms sizable pseudo-knots, a perfect
prediction is impossible in this case.

%%% This will come back when Teresa fixed Milli's script to highlight
%%% similarities/differences of n given structures.

%%\begin{frame}
%\frametitle{Refolded structure of E.~coli RNase P}
%
%\begin{figure}[h]
%  \includegraphics[width=0.6\textwidth,clip=true,trim=0cm 2cm 0cm 2cm]{Figures/ec_RNaseP.ps}
%  \caption{Correct base pairs pairing in all tree structures are
%    indicated in yellow, correct pairs added by refolding are colored
%    orange. Incorrect base pairs which are only paired in the
%    alifolded and refolded structures but not in the native structure
%    are painted blue.}
%\end{figure}
%%%\TODO{colors created with sequenzvergleich.pl}
%
%
%%\end{frame}
%
%Comparing the structures with the \emph{E.~coli} reference
%structure we find that 55 basepairs are predicted correctly in all three structures.
%After running \texttt{refold} three additional basepairs were predicted right.


%===

\section{Structural Alignments}

\subsection{Manually correcting Alignments}
As the tRNA example above demonstrates, sequence alignments are often
unsuitable as a basis for determining consensus structures. As a last
resort, one may always try manually correcting an alignment. Sequence
editors that are structure-aware may help in this task.
In particular the SARSE \url{http://sarse.kvl.dk/} editor, and the
\texttt{ralee-mode} for emacs
\url{http://personalpages.manchester.ac.uk/staff/sam.griffiths-jones/software/ralee/}
are useful.

After downloading the \texttt{ralee}-files extract them and put them
in a folder called \texttt{\textasciitilde/Tutorial/Progs/ralee}. Now
read the \texttt{00README} file and follow the instructions. If you
don't find an ``.emacs'' file in your home directory execute the
following command to copy it from the Data directory.\\

\begin{verbatim}
$ cp Data/dot.emacs ~/
\end{verbatim}

Next try correcting the \texttt{ClustalW} generated alignment \texttt{(four.aln)} from the
example above. For this we first have to convert it to the Stockholm format.
Fortunately the formats are similar. Make a copy of the file add the
correct header line and the consensus structure from RNAalifold:
\begin{verbatim}
$ cp four.aln four.stk
$ emacs four.stk
  .....
$ cat four.stk
\end{verbatim}
The final alignment should look like:
\begin{tiny}
\begin{verbatim}
# STOCKHOLM 1.0

K00349          --GCCGAAAUAGCUCAGUUGGG-AGAGCGUUAGACUGAAGAUCUAAAGGUCCCCGGUUCAAUCCCGGGUUUCGGCA--
K00283          GGGCCG--GUAGCUCAUUUAGGCAGAGCGUCUGACUCUUAAUCAGACGGUCGCGUGUUCGAAUC--GCGUCCGGCCCA
M10740          --GCGGAUUUAGCUCAGUUGGG-AGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA--
AF346993        --CAGAGUGUAGCUUAAC---ACAAAGCACCCAACUUACACUUAGGAGAUUUCAACUUAACUUGACCGCUCUGA----
#=GC SS_cons    ..(((((((..((((.........)))).(((((.......))))).....(((((.......))))))))))))...
//
\end{verbatim}%$
\end{tiny}
Now use the functions under the edit menu to improve the alignment, the
coloring by structure should help to highlight misaligned positions.

\subsection{Automatic structural alignments}

Next, we'll compute alignments using two structural alignment programs:
\texttt{LocARNA} and \texttt{T-Coffee}. \texttt{LocARNA} is an implementation
of the Sankoff algorithm for simultaneous folding and alignment (i.e. it
will generate both alignment and consensus structure). \texttt{T-Coffee} uses a
progressive alignment algorithm.

Download \texttt{LocARNA} from \url{http://www.bioinf.uni-freiburg.de/Software/LocARNA/},
extract and install it in your \texttt{Progs} folder and eventually add it to your
path variable or copy it into the corresponding directory.

Both programs can read the fasta file \texttt{four.seq}.
\begin{verbatim}
$ mlocarna --alifold-consensus-dp four.seq
[...]
M10740             GCGGAUUUAGCUCAGUUGGG-AGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
K00349             GCCGAAAUAGCUCAGUUGGG-AGAGCGUUAGACUGAAGAUCUAAAGGUCCCCGGUUCAAUCCCGGGUUUCGGCA
K00283             GGGCCGGUAGCUCAUUUAGGCAGAGCGUCUGACUCUUAAUCAGACGGUCGCGUGUUCGAAUCGCGUCCGGCCCA
AF346993           CAGAGUGUAGCUUAAC---ACAAAGCACCCAACUUACACUUAGGAGAUU-UCAACUUAA-CUUGACCGCUCUGA
alifold            (((((((..((((.........)))).(((((.......))))).....(((((.......)))))))))))).
        (-52.53 = -21.58 + -30.95)
\end{verbatim}%$

\frametitle{Install T-Coffee}

Get \texttt{T-Coffee} from the github page
\url{https://github.com/cbcrg/tcoffee}. There is a detailed
information how you should download and install the software in the
given README.md.

Go to the \texttt{downloads} directory and use the provided installer
by typing
\begin{verbatim}
$ cd Tutorial/downloads
$ git clone git@github.com:cbcrg/tcoffee.git tcoffee
$ cd tcoffee/compile/
$ make t_coffee
$ cp t_coffee ~/Tutorial/Progs/
\end{verbatim}

Afterwards align the \texttt{four.seq} using \texttt{t\_coffee} and compare the output with
the one given by LocARNA.
\begin{verbatim}
$ t_coffee four.seq > t_coffee.out

 [t_coffee.out]
 CLUSTAL FORMAT for T-COFFEE 20150925_14:18 [http://www.tcoffee.org] [MODE:  ],
 CPU=0.00 sec, SCORE=739, Nseq=4, Len=74

 M10740          GCGGAUUUAGCUCAGUU-GGGAGAGCGCCAGACUGAAGAUUUGGAGGUCC
 K00349          GCCGAAAUAGCUCAGUU-GGGAGAGCGUUAGACUGAAGAUCUAAAGGUCC
 K00283          GGGCCGGUAGCUCAUUUAGGCAGAGCGUCUGACUCUUAAUCAGACGGUCG
 AF346993        CAGAGUGUAGCUUAAC---ACAAAGCACCCAACUUACACUUAGGAGAUUU
                        ***** *       * ***     ***     *     * *

 M10740          UGUGUUCGAUCCACAGAAUUCGCA
 K00349          CCGGUUCAAUCCCGGGUUUCGGCA
 K00283          CGUGUUCGAAUCGCGUCCGGCCCA
 AF346993        CAACUUAACUUGACCG--CUCUGA
                     **                 *
\end{verbatim}

Use RNAalifold to predict structures for all your alignments
(ClustalW, handcrafted, T-Coffee, and LocARNA) and compare them. The
handcrafted and LocARNA alignments should be essentially perfect.

Other interesting approaches to structural alignment include
\texttt{CMfinder}, \texttt{dynalign}, and \texttt{stemloc}.


%% % talk.tex
%% % > dvips -P pdf -ta4 talk -o talk.temp.ps
%% % > psnup -1 -m1cm -W128mm -H96mm -pa4 talk.temp.ps talk.handout.ps
%% % > psnup -2 -m1cm -b1cm -W128mm -H96mm -pa4 talk.temp.ps talk.handout.ps
%% %
%% % -*-latex-*-
%% \NeedsTeXFormat{LaTeX2e}
%% \documentclass[a4paper]{article}
%% \usepackage{beamerarticle}
%% %\documentclass[compress,ignorenonframetext]{beamer}
%% \usepackage[english]{babel}
%% \usepackage{url}
%% %\usepackage{amsmath,amssymbols}

%% % presentation mode
%% % handout mode
%% \mode<article>{
%%   \usepackage{graphics}
%%   \usepackage{pgf}
%%   \usepackage{xcolor}
%%   \renewcommand{\theenumi}{\alph{enumi}}
%%   \renewcommand{\labelenumi}{(\theenumi)}
%%   \renewcommand{\theenumii}{\Roman{enumii}}
%%   \renewcommand{\labelenumii}{\theenumii.}
%%   \renewcommand{\labelenumii}{\theenumii.}
%%  }

%% \mode<presentation>{
%%   \beamertemplatenavigationsymbolsempty
%%   \useinnertheme{circles}
%%   \setbeamertemplate{frametitle}[default][center]
%%   \setbeamercovered{transparent}
%%   \setbeamertemplate{frametitle}[default][center]
%% }


%% % globals
%% \def\EPS{./Figures}
%% \title{RNA gene finding tutorial}
%% \author{Stefan Washietl, Ivo Hofacker, Stephan Bernhart, Andreas Gruber}
%% \institute[TBI]{Institute for Theoretical Chemistry\\ University Vienna}
%% \titlegraphic{\centerline{\pgfuseimage{logo}}}
%% \date{\tiny{\url{http://www.tbi.univie.ac.at/}}\\[1ex]\today}

%% % pictures
%% \pgfdeclareimage[width=.5cm]{logo}{\EPS/tbilogo}

%% % colors
%% \colorlet{darkgreen}{green!80!black}

%% \begin{document}

%\frame{\titlepage}
\section{Noncoding RNA gene prediction}

Prediction of ncRNAs is still a challenging problem in bioinformatics.
Unlike protein coding genes, ncRNAs do not have any statistically
significant features in primary sequences that could be used for reliable
prediction. A large class of ncRNAs, however, depend on a defined secondary
structure for their function. As a consequence, evolutionarily conserved
secondary structures can be used as characteristic signal to detect ncRNAs.
All currently available programs for \emph{de novo} prediction make use of
this principle and are therefore, by construction, limited to structured
RNAs.

%\begin{frame}
  \frametitle{Programs to predict structural RNAs}
  \begin{itemize}
    \item \texttt{\textbf{QRNA}} (Eddy \& Rivas, 2001)
    \item \texttt{ddbRNA} (di Bernardo, Down \& Hubbard, 2003)
    \item \texttt{MSARi} (Coventry, Kleitman \& Berger, 2004)
    \item \texttt{\textbf{AlifoldZ}} (Washietl \& Hofacker, 2004)
    \item \texttt{\textbf{RNAz}} (Washietl, Hofacker \& Stadler, 2005)
    \item \texttt{EvoFold} (Pedersen et al, 2006)
  \end{itemize}
%\end{frame}

\subsection{QRNA}
\texttt{QRNA} analyzes pairwise alignments for characteristic patterns of
evolution. An alignment is scored by three probabilistic models: (i)
Position independent, (ii) coding, (iii) RNA. The independent and the
coding model is a pair hidden Markov model. The RNA model is a pair
stochastic context-free grammar.  First, it calculates the \emph{prior
probability} that, given a model, the alignment is observed. Second, it
calculates the \emph{posterior probability} that, given an alignment, it
has been generated by one of the three models. The posterior probabilities
are compared to the position independent background model and a ``winner''
is found.

\texttt{QRNA} reads pairwise alignments in MFASTA format (i.e. FASTA format
  with gaps)

%\begin{frame}
\frametitle{Three competing models in QRNA}

\centerline{\includegraphics[width=0.5\textwidth,clip=]{Figures/qrna.eps}}
%\end{frame}

%\begin{frame}[fragile]
\frametitle{Installing and basic usage of QRNA}

\begin{itemize}
%% \item get the tarball from \url{ftp://ftp.genetics.wustl.edu/pub/eddy/software} and untar it
\item Use the files in \texttt{qrna-2.0.3d.tar.gz} located in the
  \texttt{Data/programs}-folder shipped with the tutorial
\item don't forget to set the \texttt{QRNADB} environment variable\\
  (e.g. \verb+export QRNADB=$HOME/Tutorial/Data/programs/qrna-2.0.3d/lib/+)
  and add it to your \texttt{.bashrc}
\item follow the instructions in the \texttt{INSTALL} document and make the binaries
\item create the directory
  \texttt{\textasciitilde/Tutorial/Progs/qrna} and move the binaries
  located in the \texttt{src/} sub-directory into this folder and add it to your
  \texttt{.bashrc}\\
  (e.g. \verb+export PATH=${HOME}/Tutorial/Progs:${PATH}:${HOME}/Tutorial/Progs/qrna+)
\item first read the help text (option \texttt{-h}).
\item for advanced use of \texttt{QRNA} read the
\texttt{userguide.pdf} shipped with the package (in the \texttt{documentation} folder
\item \texttt{-a} tells \texttt{QRNA} to print the alignment
\end{itemize}
\small
\begin{verbatim}
$ eqrna -h
$ eqrna -a Data/qrna/tRNA.fa
$ eqrna -a Data/qrna/coding.fa
\end{verbatim}

%\end{frame}
\begin{verbatim}
[...]
Divergence time (variable): 0.214132 0.208107 0.203995
[alignment ID = 72.37 MUT = 23.68 GAP = 3.95]

length alignment: 76 (id=72.37) (mut=23.68) (gap=3.95)
posX: 0-75 [0-72](73) -- (0.18 0.30 0.36 0.16)
posY: 0-75 [0-75](76) -- (0.14 0.34 0.37 0.14)


	         DA0780 GGGCTCGTAGCTCAGCT.GGAAGAGCGCGGCGTTTGCAACGCCGAGGCCT
	         DA0940 GGGCCGGTAGCTCAGCCTGGGAGAGCGTCGGCTTTGCAAGCCGAAGGCCC

	         DA0780 GGGGTTCAAATCCCCACGGGTCCA..
	         DA0940 CGGGTTCGAATCCCGGCCGGTCCACC
[..]
\end{verbatim}
\normalsize

\subsection{AlifoldZ}
\texttt{AlifoldZ} is based on an old hypothesis: functional RNAs are
thermodynamically more stable than expected by chance. This hypothesis can
be statistically tested by calculating $z$-scores: Calculate the MFE $m$ of
the native RNA and the mean $\mu$ and standard deviation $\sigma$ of the
background distribution of a large number of random (shuffled) RNAs.  The
normalized $z$-score $z=(m-\mu)/\sigma$ expresses how many standard
deviations the native RNA is more stable than random sequences.

Unfortunately, most ncRNAs are not significantly more stable than the
background. See for example the distribution of $z$-scores of some tRNAs,
where the overlap of real (green bars) and shuffled (dashed line) tRNAs
is relatively high.

%\begin{frame}
\frametitle{z-score distribution of tRNAs}
\begin{figure}
\centering
\includegraphics[width=0.6\textwidth,clip=]{Figures/trna-histo.eps}\\
The overlap of real (bars) and shuffled (dashed line) tRNAs is relatively high.
\end{figure}

%\end{frame}
\noindent
\texttt{AlifoldZ} calculates $z$-scores for consensus structures folded by
\texttt{RNAalifold}. This significantly improves the detection performance compared
to single sequence folding.

%\begin{frame}
\frametitle{z-score distribution of tRNA consensus folds}
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth,clip=]{Figures/histos.eps}\\
The separation of real and shuffled tRNAs gets evident with more sequences in the alignment.
\end{figure}
%\end{frame}

%\begin{frame}[fragile]
\frametitle{Installation and basic usage of AlifoldZ}

\begin{itemize}
%%\item Available at: \small{\url{http://www.tbi.univie.ac.at/papers/SUPPLEMENTS/Alifoldz/}}
\item Use the tarball \texttt{alifoldz\_adopted.tar.gz} located in the
  \texttt{Data/programs/}-folder shipped with the tutorial
\item Copy the files into your \texttt{Progs} directory (It's just one
  single Perl script which needs \texttt{RNAfold} and
  \texttt{RNAalifold} and an important perl module located in the Math
  subdirectory)\\
  \verb+ $ cp -r alifoldz.pl Math/ ~/Tutorial/Progs/+
\item add the perl module to your \texttt{PERL5LIB} variable in the
  \texttt{.bashrc}\\
  \verb+ $ export PERL5LIB=$HOME/Tutorial/Progs/:$PERL5LIB+
\item test the tool
\end{itemize}
\begin{verbatim}
$ alifoldz.pl -h
$ alifoldz.pl < Data/alifoldz/miRNA.aln
$ alifoldz.pl -w 120 -x 100 < Data/alifoldz/unknown.aln
\end{verbatim}

\normalsize

%\end{frame}

\subsection{RNAz}
\TODO{New version by Someone who loves RNAz. This part of the tutorial
  is based on the RNAz 1.0 version which is obsolete quite a while
  already!!!}\\

\texttt{AlifoldZ} has some shortcomings that limits its usefulness in
practice: The $z$-scores are not deterministic, i.e. you get a
different score each time you run \texttt{AlifoldZ}. To get stable
$z$-scores you need to sample a large number of random alignments
which is computationally expensive. Moreover, \texttt{AlifoldZ} is
extremely sensitive to alignment errors.

The program \texttt{RNAz} overcomes these problems by using a different
approach to asses a multiple sequence alignment for significant RNA
structures. It is based on two key innovations: (i) The structure
conservation index (SCI) to measure structural conservation in an alignment
and (ii) $z$-scores that are calculated by regression without sampling.
Both measures are combined to an overall score that is used to classify an
alignment as ``structured RNA'' or ``other''.

%\begin{frame}
\frametitle{The structure conservation index}

\centerline{\includegraphics[width=0.8\textwidth,clip=]{Figures/sci.eps}}

\begin{itemize}

\item The structure conservation index is an easy way to normalize an
  \texttt{RNAalifold} consensus MFE.

\end{itemize}

%\end{frame}

%\begin{frame}

  \frametitle{z-score regression}

  \begin{itemize}
  \item The mean $\mu$ and standard deviation $\sigma$ of random samples of
    a given sequence are functions of the length and the base composition:
    $$\mu,\sigma(length,\frac{GC}{AT},\frac{G}{C},\frac{A}{T})$$
  \item It is therefore be possible to \textit{calculate} $z$-scores by
    solving this 5 dimensional regression problem.
  \end{itemize}
%\end{frame}

%\begin{frame}
  \frametitle{SVM Classification}

  \centerline{\includegraphics[width=0.5\textwidth,angle=-90,clip=]{Figures/contour.eps}}

  \begin{itemize}
  \item A support vector machine learning algorithm is used to classify an
    alignment based on $z$-score and structure conservation index.
  \end{itemize}

%\end{frame}

%\begin{frame}[fragile]

\frametitle{Installation of RNAz}

Installation is done according to the instructions used by the \texttt{ViennaRNA Package}.
Just use the \texttt{--prefix} option as mentioned earlier and add the PATH to \texttt{.bashrc}
\begin{itemize}
\item RNAz is available at: \url{http://www.tbi.univie.ac.at/~wash/RNAz}
\item Package includes the core program \texttt{RNAz} in ISO \texttt{C},
a set of helper programs in Perl, and an extensive manual.
\end{itemize}

%\end{frame}

%\begin{frame}[fragile]

  \frametitle{Basic usage of RNAz}
\TODO{where to get examples from (RNAz install package) - commands work with v2 but txt needs to be adopted}
  \begin{itemize}
  \item \texttt{RNAz} reads one or more multiple sequence alignments in \texttt{clustalw} or
    MAF format.
  \end{itemize}

\small
\begin{verbatim}
$ RNAz --help
$ RNAz tRNA.aln
$ RNAz --both-strands --predict-strand tRNA.maf
\end{verbatim}
\normalsize

%\end{frame}

%\begin{frame}[fragile]

  \frametitle{Advanced usage of RNAz}

  \begin{itemize}
  \item \texttt{RNAz} is limited to a maximum alignment length of 400 columns and a
    maximum number of 6 sequences. To process larger alignments a set of
    Perl helper scripts are used.
  \item Selecting one or more subsets of sequences from an alignment with
    more than 6 sequences:
  \end{itemize}

  \small
\begin{verbatim}
$ rnazSelectSeqs.pl miRNA.maf |RNAz
$ rnazSelectSeqs.pl --num-seqs=4 --num-samples=3 miRNA.maf |RNAz
\end{verbatim}
  \normalsize
  \begin{itemize}
    \item Scoring long alignments in overlapping windows:
  \end{itemize}
  \small
\begin{verbatim}
$ rnazWindow.pl --window=120 --slide=40 unknown.aln \
                | RNAz --both-strands
\end{verbatim}
  \normalsize

%\end{frame}

\subsection{Large scale screens}

The \texttt{RNAz} package provides a set of Perl scripts that implement a
complete analysis pipeline suitable for medium to large scale screens of
genomic data.

%\begin{frame}

  \frametitle{General procedure}

  \begin{enumerate}
    \item Obtain or create multiple sequence alignments in MAF format
    \item Run through the \texttt{RNAz} pipeline:
  \end{enumerate}

  \centerline{\includegraphics[width=0.6\textwidth,clip=]{Figures/flowchart.eps}}

%\end{frame}

%\begin{frame}
\frametitle{Examples in this tutorial}

\begin{enumerate}

\item Align Epstein Barr Virus genome (Acc.no: NC\_007605) to two related
  primate viruses (Acc.nos: NC\_004367, NC\_006146) using \texttt{multiz}
  and run it through the \texttt{RNAz} pipeline.
\TODO{where are this data comeing from? file from NCBI differs from those hidden in genefinding/rnaz/herpes}

\item Analyze snoRNA cluster in the human genome for conserved RNA
  structures: download pre-computed alignments from the UCSC genome browser
  and run it through the \texttt{RNAz} pipeline

\end{enumerate}


%\end{frame}
%\begin{frame}[fragile]

  \frametitle{Example a: Preparation of data}
  \begin{itemize}
  \item \texttt{multiz} and \texttt{blastz} are available here:
    \url{http://www.bx.psu.edu/miller_lab/}
  \item Download the viral genomes in FASTA format and reformat the header
    strictly according to the rules given in the \texttt{multiz} documentation
    (\url{http://www.bx.psu.edu/miller_lab/dist/tba_howto.pdf}), e.g.:
  \item You have to edit the ''multiz '' \texttt{Makefile} and replace
\begin{verbatim}
CFLAGS = -Wall -Wextra -Werror
\end{verbatim}
with
\begin{verbatim}
CFLAGS = -Wall -Wextra #-Werror
\end{verbatim}
and then simply use the \texttt{make} command to compile both programs.
  \end{itemize}
\TODO{dont understand what to do}
\small
\begin{verbatim}
>NC_007605:genome:1:+:149696
AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTAT
GAATCCAGTATGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGC
CGCCTCGTGTTTCACGGCCTCAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTC
ACTTCTACTCTTGGCAGCAGTGGCCAGCTCATATGCCGCTGCACAAAGGAAACTGCTGAC
ACCGGTGACAGTGCTTACTGCGGTTGTCACTTGTGAGTACACACGCACCATTTACAATGC
ATGATGTTCGTGAGATTGATCTGTCTCTAACAGTTCACTTCCTCTGCTTTTCTCCTCAGT
CTTTGCAATTTGCCTAACATGGAGGATTGAGGACCCACCTTTTAATTCTCTTCTGTTTGC
[...]
\end{verbatim}
\normalsize
%\end{frame}
%\begin{frame}[fragile]
  \frametitle{Example a: Aligning viral genomes}

  \begin{itemize}
  \item To get a multiple alignment a phylogenetic tree and the following
    three steps are necessary:
    \begin{enumerate}
      \item Run \texttt{blastz} each vs. each
      \item Combine blastz results to multiple sequence alignments
      \item Project raw alignments to a reference sequence.
    \end{enumerate}
  \item The corresponding commands:
  \end{itemize}
\small
\begin{verbatim}
all_bz - "((NC_007605 NC_006146) NC_004367)" | bash
tba "((NC_007605 NC_006146) NC_004367)" \
*.sing.maf raw-tba.maf
maf_project raw-tba.maf NC_007605 > final.maf
\end{verbatim}
\normalsize
\begin{itemize}
\item Note: The tree is given in NEWICK like format with blanks instead
  of commas. The sequence data files must be named exactly like the names
  in this tree and in the FASTA headers.
\end{itemize}


%\end{frame}

%\begin{frame}[fragile]

  \frametitle{Example a: Running the pipeline I}

  \begin{itemize}
  \item First the alignments are filtered and sliced in overlapping windows:
  \end{itemize}
\small
\begin{verbatim}
$ rnazWindow.pl < final.maf > windows.maf
\end{verbatim}
\normalsize
  \begin{itemize}
  \item \texttt{RNAz} is run on these windows:
  \end{itemize}
\small
\begin{verbatim}
$ RNAz --both-strands --show-gaps --cutoff=0.5 windows.maf \
        > rnaz.out
\end{verbatim}
\normalsize
\begin{itemize}
  \item Overlapping hits are combined to ``loci'' and visualized on a web-site:
  \end{itemize}
\small
\begin{verbatim}
$ rnazCluster.pl --html rnaz.out > results.dat
\end{verbatim}
\normalsize

%\end{frame}

%\begin{frame}[fragile]

 \frametitle{Example a: Running the pipeline II}

\begin{itemize}
  \item The predicted hits are compared with available annotation of the genome:
  \end{itemize}
\small
\begin{verbatim}
$ rnazAnnotate.pl --bed annotation.bed results.dat \
          > results_annotated.dat
\end{verbatim}
\normalsize
\begin{itemize}
  \item The results file is formatted in a HTML overview page:
  \end{itemize}
\small
\begin{verbatim}
$ rnazIndex.pl --html results_annotated.dat \
          > results/index.html
\end{verbatim}
\normalsize
%\end{frame}


%\begin{frame}[fragile]

  \frametitle{Example a: Statistics on the results}

\begin{itemize}
\item \texttt{rnazIndex.pl} can be used to generate a BED formatted
  annotation file which can be analyzed using \texttt{rnazBEDstats.pl}
  (after sorting, for the case the input alignments were unsorted)''
\end{itemize}

\small
\begin{verbatim}
$ rnazIndex.pl --bed results.dat | \
          rnazBEDsort.pl | rnazBEDstats.pl
\end{verbatim}
\normalsize

\begin{itemize}
\item RNAzfilter.pl can be used to filter the results by different
  criteria. In this case it gives us all loci with P$>$0.9'':
\end{itemize}

\small
\begin{verbatim}
$ rnazFilter.pl "P>0.9" results.dat | \
          rnazIndex.pl --bed | \
          rnazBEDsort.pl | rnazBEDstats.pl
\end{verbatim}
\normalsize

\begin{itemize}
\item To get an estimate on the (statistical) false positives one can
  repeat the complete screen with randomized alignments:
\end{itemize}

\small
\begin{verbatim}
$ rnazRandomizeAln final.maf > random.maf
\end{verbatim}
\normalsize

%\end{frame}


%\begin{frame}[fragile]

  \frametitle{Example b: Obtaining pre-computed alignments from UCSC}

  \begin{itemize}
  \item Go to the UCSC genome browser (\url{http://genome.ucsc.edu}) and go to
    ``Tables''. Download ``multiz17'' alignments in MAF format for the
    region: chr11:93103000-93108000
  \end{itemize}

  \centerline{\includegraphics[width=\textwidth,clip=]{Figures/table-browser.eps}}

%\end{frame}

%\begin{frame}[fragile]

  \frametitle{Example b: Running the pipeline}

  \begin{itemize}
  \item The Perl scripts are run in the same order as in Example 1:
  \end{itemize}
\small
\begin{verbatim}
$ rnazWindow.pl --min-seqs=4 region.maf > windows.maf
$ RNAz --both-strands --show-gaps --cutoff=0.5 windows.maf \
          > rnaz.out
$ rnazCluster.pl --html rnaz.out > results.dat
$ rnazAnnotate.pl --bed annotation.bed results.dat \
          > results_annotated.dat
$ rnazIndex.pl --html results_annotated.dat \
          > results/index.html
\end{verbatim}
\normalsize
  \begin{itemize}
  \item The results can be exported as UCSC BED file which can be displayed
    in the genome browser:
  \end{itemize}
\small
\begin{verbatim}
$ rnazIndex.pl --bed --ucsc results.dat > prediction.bed
\end{verbatim}
\normalsize

%\end{frame}

%\begin{frame}[fragile]

  \frametitle{Example b: Visualizing the results on the genome browser}

  \begin{itemize}
  \item Upload the BED file as ``Custom Track''\dots
  \end{itemize}

  \centerline{\includegraphics[width=0.8\textwidth,clip=]{Figures/table-browser.eps}}

  \begin{itemize}
  \item \dots and have a look at the results:
  \end{itemize}

  \centerline{\includegraphics[width=\textwidth,clip=]{Figures/snos.eps}}


%\end{frame}

%\end{document}

%\renewcommand{\refname}{\Large Literature}
%\nocite{*}
%\bibliographystyle{unsrt}
%\bibliography{./talk}

%===
\end{document}
%===

% LocalWords:  ATGAAGATGA gzipped RNAalifold RNAs dimerization RNAdistance pdf
% LocalWords:  RNAduplex RNAeval RNAfold RNAheat RNAinverse RNALfold RNApaln ta
% LocalWords:  RNApdist RNAplfold RNAplot RNAsubopt stdout stdin wc cd pwd txt
% LocalWords:  MANPATH online RNAcofold Zukers mfold cmount coloraln alirna ps
% LocalWords:  colorrna dpzoom databanks readseq popt Zuker's subopt relplot ss
% LocalWords:  MFE mol PostScript dp definedness xy xmgrace ij nt noLP tRNAs gv
% LocalWords:  alifold alidot gsview RNASubopt mRNA Dalgarno AUGC Fp hetero kb
% LocalWords:  GC UA UG tRNA clustalw Clustal kcal miRNA microRNA mir pos intra
% LocalWords:  pre executables LaTeX basename fasta GIF JPEG inv rRNA Cofolding
% LocalWords:  mmol minima treekin ncRNA coli RNase pknotsRG paRNAss RNAmovies
% LocalWords:  RNAforester RNAshapes pfold suboptimals siRNA foldings SARSE aln
% LocalWords:  ralee emacs QRNA ddbRNA di MSARi Kleitman AlifoldZ EvoFold al un
% LocalWords:  Pedersen Untar MFASTA SVM ClustalW MAF Acc multiz snoRNA UCSC de
% LocalWords:  blastz NEWICK rnazIndex rnazBEDstats RNAzfilter chr RybB ompN et
% LocalWords:  RNAup stral LocARNA StrAl CMfinder dynalign stemloc Noncoding
% LocalWords:  novo