1Below, you\textquotesingle{}ll find a listing of different sections that introduce the most common notations of sequence and structure data, specifications of bioinformatics sequence and structure file formats, and various output file formats produced by our library.
2
3
4\begin{DoxyItemize}
5\item \mbox{\hyperlink{rna_structure_notations}{RNA Structure Notations}} describes the different notations and representations of RNA secondary structures
6\item \mbox{\hyperlink{file_formats}{File Formats}} gives an overview of the file formats compatible with our library
7\item \mbox{\hyperlink{plots}{Plotting}} shows the different (Post\+Script) plotting functions for RNA secondary structures, feature probabilities, and multiple sequence alignments
8\end{DoxyItemize}\hypertarget{rna_structure_notations}{}\doxysection{RNA Structure Notations}\label{rna_structure_notations}
9\hypertarget{rna_structure_notations_sec_structure_representations}{}\doxysubsection{Representations of Secondary Structures}\label{rna_structure_notations_sec_structure_representations}
10The standard representation of a secondary structure in our library is the \mbox{\hyperlink{rna_structure_notations_dot-bracket-notation}{Dot-\/\+Bracket Notation (a.\+k.\+a. Dot-\/\+Parenthesis Notation)}}, where matching brackets symbolize base pairs and unpaired bases are shown as dots. Based on that notation, more elaborate representations have been developed to include additional information, such as the loop context a nucleotide belongs to and to annotated pseudo-\/knots.\hypertarget{rna_structure_notations_dot-bracket-notation}{}\doxysubsubsection{Dot-\/\+Bracket Notation (a.\+k.\+a. Dot-\/\+Parenthesis Notation)}\label{rna_structure_notations_dot-bracket-notation}
11The Dot-\/\+Bracket notation as introduced already in the early times of the Vienna\+RNA Package denotes base pairs by matching pairs of parenthesis {\ttfamily ()} and unpaired nucleotides by dots {\ttfamily .}.
12
13As a simple example, consider a helix of size 4 enclosing a hairpin of size 4. In dot-\/bracket notation, this is annotated as
14
15{\ttfamily ((((....))))}
16
17{\bfseries{Extended Dot-\/\+Bracket Notation}}
18
19A more generalized version of the original Dot-\/\+Bracket notation may use additional pairs of brackets, such as {\ttfamily $<$$>$}, {\ttfamily \{\}}, and {\ttfamily \mbox{[}\mbox{]}}, and matching pairs of uppercase/lowercase letters. This allows for anotating pseudo-\/knots, since different pairs of brackets are not required to be nested.
20
21The follwing annotations of a simple structure with two crossing helices of size 4 are equivalent\+:
22
23{\ttfamily $<$$<$$<$$<$\mbox{[}\mbox{[}\mbox{[}\mbox{[}....$>$$>$$>$$>$\mbox{]}\mbox{]}\mbox{]}\mbox{]}}~\newline
24 {\ttfamily ((((AAAA....))))aaaa}~\newline
25 {\ttfamily AAAA\{\{\{\{....aaaa\}\}\}\}} \begin{DoxySeeAlso}{See also}
26\mbox{\hyperlink{group__struct__utils__dot__bracket_ga55c4783060a1464f862f858d5599c9e1}{vrna\+\_\+db\+\_\+pack()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_ga6490adff857d84ce06e6f379ae3a4512}{vrna\+\_\+db\+\_\+unpack()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_gafd1304f5a86e2e3f1425e725cde44fa2}{vrna\+\_\+db\+\_\+flatten()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_ga690425199c8b71545e7196e3af1436f8}{vrna\+\_\+db\+\_\+flatten\+\_\+to()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_gaf9ecd0d7877fecdbb0292e24f40283d5}{vrna\+\_\+db\+\_\+from\+\_\+ptable()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_ga6a51a36b9245d0bac868c5cd172b9611}{vrna\+\_\+db\+\_\+from\+\_\+plist()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_ga45360c09fb6d04d96e42dcccbb66015b}{vrna\+\_\+db\+\_\+to\+\_\+element\+\_\+string()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_ga97dbebaa3fc49524cf5afa338a6c52ee}{vrna\+\_\+db\+\_\+pk\+\_\+remove()}}
27\end{DoxySeeAlso}
28\hypertarget{rna_structure_notations_wuss-notation}{}\doxysubsubsection{Washington University Secondary Structure (\+WUSS) notation}\label{rna_structure_notations_wuss-notation}
29The WUSS notation, as frequently used for consensus secondary structures in \mbox{\hyperlink{file_formats_msa-formats-stockholm}{Stockholm 1.\+0 format}}.
30
31This notation allows for a fine-\/grained annotation of base pairs and unpaired nucleotides, including pseudo-\/knots. Below, you\textquotesingle{}ll find a list of secondary structure elements and their corresponding WUSS annotation (See also the infernal user guide at \href{http://eddylab.org/infernal/Userguide.pdf}{\texttt{ http\+://eddylab.\+org/infernal/\+Userguide.\+pdf}})
32\begin{DoxyItemize}
33\item {\bfseries{Base pairs}}~\newline
34 Nested base pairs are annotated by matching pairs of the symbols {\ttfamily $<$$>$}, {\ttfamily ()}, {\ttfamily \{\}}, and {\ttfamily \mbox{[}\mbox{]}}. Each of the matching pairs of parenthesis have their special meaning, however, when used as input in our programs, e.\+g. structure constraint, these details are usually ignored. Furthermore, base pairs that constitute as pseudo-\/knot are denoted by letters from the latin alphabet and are, if not denoted otherwise, ignored entirely in our programs.
35\item {\bfseries{Hairpin loops}}~\newline
36 Unpaired nucleotides that constitute the hairpin loop are indicated by underscores, {\ttfamily \+\_\+}.
37
38Example\+: {\ttfamily $<$$<$$<$$<$$<$\+\_\+\+\_\+\+\_\+\+\_\+\+\_\+$>$$>$$>$$>$$>$}
39\item {\bfseries{Bulges and interior loops}}~\newline
40 Residues that constitute a bulge or interior loop are denoted by dashes, {\ttfamily -\/}.
41
42Example\+: {\ttfamily (((-\/-\/$<$$<$\+\_\+\+\_\+\+\_\+\+\_\+\+\_\+$>$$>$-\/)))}
43\item {\bfseries{Multibranch loops}}~\newline
44 Unpaired nucleotides in multibranch loops are indicated by commas {\ttfamily ,}.
45
46Example\+: {\ttfamily (((,,$<$$<$\+\_\+\+\_\+\+\_\+\+\_\+\+\_\+$>$$>$,$<$$<$\+\_\+\+\_\+\+\_\+\+\_\+$>$$>$)))}
47\item {\bfseries{External residues}}~\newline
48 Single stranded nucleotides in the exterior loop, i.\+e. not enclosed by any other pair are denoted by colons, {\ttfamily \+:}.
49
50Example\+: {\ttfamily $<$$<$$<$\+\_\+\+\_\+\+\_\+\+\_\+$>$$>$$>$\+::\+:}
51\item {\bfseries{Insertions}}~\newline
52 In cases where an alignment represents the consensus with a known structure, insertions relative to the known structure are denoted by periods, {\ttfamily .}. Regions where local structural alignment was invoked, leaving regions of both target and query sequence unaligned, are indicated by tildes, {\ttfamily $\sim$}. \begin{DoxyNote}{Note}
53These symbols only appear in alignments of a known (query) structure annotation to a target sequence of unknown structure.
54\end{DoxyNote}
55
56\item {\bfseries{Pseudo-\/knots}}~\newline
57 The WUSS notation allows for annotation of pseudo-\/knots using pairs of upper-\/case/lower-\/case letters. \begin{DoxyNote}{Note}
58Our programs and library functions usually ignore pseudo-\/knots entirely treating them as unpaired nucleotides, if not stated otherwise.
59\end{DoxyNote}
60Example\+: {\ttfamily $<$$<$$<$\+\_\+\+AAA\+\_\+\+\_\+\+\_\+$>$$>$$>$aaa}
61\end{DoxyItemize}
62
63\begin{DoxySeeAlso}{See also}
64\mbox{\hyperlink{group__struct__utils__wuss_ga02ca70cffb2d864f7b2d95d92218bae0}{vrna\+\_\+db\+\_\+from\+\_\+\+WUSS()}}
65\end{DoxySeeAlso}
66\hypertarget{rna_structure_notations_shapes-notation}{}\doxysubsubsection{Abstract Shapes}\label{rna_structure_notations_shapes-notation}
67Abstract Shapes, introduced by Giegerich et al. in (2004) \cite{giegerich:2004}, collapse the secondary structure while retaining the nestedness of helices and hairpin loops.
68
69The abstract shapes representation abstracts the structure from individual base pairs and their corresponding location in the sequence, while retaining the inherent nestedness of helices and hairpin loops.
70
71Below is a description of what is included in the abstract shapes abstraction for each respective level together with an example structure\+: \begin{DoxyVerb}CGUCUUAAACUCAUCACCGUGUGGAGCUGCGACCCUUCCCUAGAUUCGAAGACGAG
72((((((...(((..(((...))))))...(((..((.....))..)))))))))..
73\end{DoxyVerb}
74 \DoxyHorRuler{0}
75
76
77\tabulinesep=1mm
78\begin{longtabu}spread 0pt [c]{*{3}{|X[-1]}|}
79\hline
80\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Shape Level   }&\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Description   }&\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Result    }\\\cline{1-3}
81\endfirsthead
82\hline
83\endfoot
84\hline
85\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Shape Level   }&\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Description   }&\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Result    }\\\cline{1-3}
86\endhead
871   &Most accurate -\/ all loops and all unpaired   &{\ttfamily \mbox{[}\+\_\+\mbox{[}\+\_\+\mbox{[}\mbox{]}\mbox{]}\+\_\+\mbox{[}\+\_\+\mbox{[}\mbox{]}\+\_\+\mbox{]}\mbox{]}\+\_\+}    \\\cline{1-3}
882   &Nesting pattern for all loop types and unpaired regions in external loop and multiloop   &{\ttfamily \mbox{[}\mbox{[}\+\_\+\mbox{[}\mbox{]}\mbox{]}\mbox{[}\+\_\+\mbox{[}\mbox{]}\+\_\+\mbox{]}\mbox{]}}    \\\cline{1-3}
893   &Nesting pattern for all loop types but no unpaired regions   &{\ttfamily \mbox{[}\mbox{[}\mbox{[}\mbox{]}\mbox{]}\mbox{[}\mbox{[}\mbox{]}\mbox{]}\mbox{]}}    \\\cline{1-3}
904   &Helix nesting pattern in external loop and multiloop   &{\ttfamily \mbox{[}\mbox{[}\mbox{]}\mbox{[}\mbox{[}\mbox{]}\mbox{]}\mbox{]}}    \\\cline{1-3}
915   &Most abstract -\/ helix nesting pattern and no unpaired regions   &{\ttfamily \mbox{[}\mbox{[}\mbox{]}\mbox{[}\mbox{]}\mbox{]}}   \\\cline{1-3}
92\end{longtabu}
93
94
95\begin{DoxyNote}{Note}
96Our implementations also provide the special Shape Level 0, which does not collapse any structural features but simply convert base pairs and unpaired nucleotides into their corresponding set of symbols for abstract shapes.
97\end{DoxyNote}
98\begin{DoxySeeAlso}{See also}
99\mbox{\hyperlink{group__struct__utils__abstract__shapes_gafca0add98ede22bf2c22608878c61b22}{vrna\+\_\+abstract\+\_\+shapes()}}, \mbox{\hyperlink{group__struct__utils__abstract__shapes_ga2fd59087e1c4e3d460e5823ba6d693b4}{vrna\+\_\+abstract\+\_\+shapes\+\_\+pt()}}
100\end{DoxySeeAlso}
101\hypertarget{rna_structure_notations_sec_structure_representations_tree}{}\doxysubsubsection{Tree Representations of Secondary Structures}\label{rna_structure_notations_sec_structure_representations_tree}
102Secondary structures can be readily represented as trees, where internal nodes represent base pairs, and leaves represent unpaired nucleotides. The dot-\/bracket structure string already is a tree represented by a string of parenthesis (base pairs) and dots for the leaf nodes (unpaired nucleotides).
103
104Alternatively, one may find representations with two types of node labels, {\ttfamily P} for paired and {\ttfamily U} for unpaired; a dot is then replaced by {\ttfamily (U)}, and each closed bracket is assigned an additional identifier {\ttfamily P}. We call this the expanded notation. In \cite{fontana:1993b} a condensed representation of the secondary structure is proposed, the so-\/called homeomorphically irreducible tree (HIT) representation. Here a stack is represented as a single pair of matching brackets labeled {\ttfamily P} and weighted by the number of base pairs. Correspondingly, a contiguous strain of unpaired bases is shown as one pair of matching brackets labeled {\ttfamily U} and weighted by its length. Generally any string consisting of matching brackets and identifiers is equivalent to a plane tree with as many different types of nodes as there are identifiers.
105
106Bruce Shapiro proposed a coarse grained representation \cite{shapiro:1988}, which, does not retain the full information of the secondary structure. He represents the different structure elements by single matching brackets and labels them as
107
108
109\begin{DoxyItemize}
110\item {\ttfamily H} (hairpin loop),
111\item {\ttfamily I} (interior loop),
112\item {\ttfamily B} (bulge),
113\item {\ttfamily M} (multi-\/loop), and
114\item {\ttfamily S} (stack).
115\end{DoxyItemize}
116
117We extend his alphabet by an extra letter for external elements {\ttfamily E}. Again these identifiers may be followed by a weight corresponding to the number of unpaired bases or base pairs in the structure element. All tree representations (except for the dot-\/bracket form) can be encapsulated into a virtual root (labeled {\ttfamily R}).
118
119The following example illustrates the different linear tree representations used by the package\+:
120
121Consider the secondary structure represented by the dot-\/bracket string (full tree) {\ttfamily .((..(((...)))..((..)))).} which is the most convenient condensed notation used by our programs and library functions.
122
123Then, the following tree representations are equivalent\+:
124
125
126\begin{DoxyItemize}
127\item Expanded tree\+:~\newline
128 {\ttfamily ((U)(((U)(U)((((U)(U)(U)P)P)P)(U)(U)(((U)(U)P)P)P)P)(U)R)}
129\item HIT representation (Fontana et al. 1993 \cite{fontana:1993b})\+:~\newline
130 {\ttfamily ((U1)((U2)((U3)P3)(U2)((U2)P2)P2)(U1)R)}
131\item Coarse Grained \mbox{\hyperlink{structTree}{Tree}} Representation (Shapiro 1988 \cite{shapiro:1988})\+:
132\begin{DoxyItemize}
133\item Short (with root node {\ttfamily R}, without stem nodes {\ttfamily S})\+:~\newline
134 {\ttfamily ((H)((H)M)R)}
135\item Full (with root node {\ttfamily R})\+:~\newline
136 {\ttfamily (((((H)S)((H)S)M)S)R)}
137\item Extended (with root node {\ttfamily R}, with external nodes {\ttfamily E})\+:~\newline
138 {\ttfamily ((((((H)S)((H)S)M)S)E)R)}
139\item Weighted (with root node {\ttfamily R}, with external nodes {\ttfamily E})\+:~\newline
140 {\ttfamily ((((((H3)S3)((H2)S2)M4)S2)E2)R)}
141\end{DoxyItemize}
142\end{DoxyItemize}
143
144The Expanded tree is rather clumsy and mostly included for the sake of completeness. The different versions of Coarse Grained \mbox{\hyperlink{structTree}{Tree}} Representations are variatios of Shapiro\textquotesingle{}s linear tree notation.
145
146For the output of aligned structures from string editing, different representations are needed, where we put the label on both sides. The above examples for tree representations would then look like\+:
147
148\begin{DoxyVerb}*  a) (UU)(P(P(P(P(UU)(UU)(P(P(P(UU)(UU)(UU)P)P)P)(UU)(UU)(P(P(UU)(U...
149*  b) (UU)(P2(P2(U2U2)(P2(U3U3)P3)(U2U2)(P2(U2U2)P2)P2)(UU)P2)(UU)
150*  c) (B(M(HH)(HH)M)B)
151*     (S(B(S(M(S(HH)S)(S(HH)S)M)S)B)S)
152*     (E(S(B(S(M(S(HH)S)(S(HH)S)M)S)B)S)E)
153*  d) (R(E2(S2(B1(S2(M4(S3(H3)S3)((H2)S2)M4)S2)B1)S2)E2)R)
154*  \end{DoxyVerb}
155
156
157Aligned structures additionally contain the gap character {\ttfamily \+\_\+}. \begin{DoxySeeAlso}{See also}
158\mbox{\hyperlink{group__struct__utils__tree_ga56551ab7da64933a7230d29430f40cfe}{vrna\+\_\+db\+\_\+to\+\_\+tree\+\_\+string()}}, \mbox{\hyperlink{group__struct__utils__tree_gaa31da26a3f582ddc35a84ff1b9c0a2b0}{vrna\+\_\+tree\+\_\+string\+\_\+unweight()}}, \mbox{\hyperlink{group__struct__utils__tree_ga99d280319a7fd3f87e9f0d8c44520774}{vrna\+\_\+tree\+\_\+string\+\_\+to\+\_\+db()}}
159\end{DoxySeeAlso}
160\hypertarget{rna_structure_notations_structure_notations_examples}{}\doxysubsection{Examples for Structure Parsing and Conversion}\label{rna_structure_notations_structure_notations_examples}
161\hypertarget{rna_structure_notations_structure_notations_api}{}\doxysubsection{Structure Parsing and Conversion API}\label{rna_structure_notations_structure_notations_api}
162Several functions are provided for parsing structures and converting to different representations.
163
164\begin{DoxyVerb}char  *expand_Full(const char *structure)
165\end{DoxyVerb}
166 Convert the full structure from bracket notation to the expanded notation including root.
167
168\begin{DoxyVerb}char *b2HIT (const char *structure)
169\end{DoxyVerb}
170 Converts the full structure from bracket notation to the HIT notation including root.
171
172\begin{DoxyVerb}char *b2C (const char *structure)
173\end{DoxyVerb}
174 Converts the full structure from bracket notation to the a coarse grained notation using the \textquotesingle{}H\textquotesingle{} \textquotesingle{}B\textquotesingle{} \textquotesingle{}I\textquotesingle{} \textquotesingle{}M\textquotesingle{} and \textquotesingle{}R\textquotesingle{} identifiers.
175
176\begin{DoxyVerb}char *b2Shapiro (const char *structure)
177\end{DoxyVerb}
178 Converts the full structure from bracket notation to the {\itshape weighted} coarse grained notation using the \textquotesingle{}H\textquotesingle{} \textquotesingle{}B\textquotesingle{} \textquotesingle{}I\textquotesingle{} \textquotesingle{}M\textquotesingle{} \textquotesingle{}S\textquotesingle{} \textquotesingle{}E\textquotesingle{} and \textquotesingle{}R\textquotesingle{} identifiers.
179
180\begin{DoxyVerb}char  *expand_Shapiro (const char *coarse);
181\end{DoxyVerb}
182 Inserts missing \textquotesingle{}S\textquotesingle{} identifiers in unweighted coarse grained structures as obtained from \mbox{\hyperlink{group__struct__utils__deprecated_ga9c80d92391f2833549a8b6dac92233f0}{b2\+C()}}.
183
184\begin{DoxyVerb}char *add_root (const char *structure)
185\end{DoxyVerb}
186 Adds a root to an un-\/rooted tree in any except bracket notation.
187
188\begin{DoxyVerb}char  *unexpand_Full (const char *ffull)
189\end{DoxyVerb}
190 Restores the bracket notation from an expanded full or HIT tree, that is any tree using only identifiers \textquotesingle{}U\textquotesingle{} \textquotesingle{}P\textquotesingle{} and \textquotesingle{}R\textquotesingle{}.
191
192\begin{DoxyVerb}char  *unweight (const char *wcoarse)
193\end{DoxyVerb}
194 Strip weights from any weighted tree.
195
196\begin{DoxyVerb}void   unexpand_aligned_F (char *align[2])
197\end{DoxyVerb}
198 Converts two aligned structures in expanded notation.
199
200\begin{DoxyVerb}void   parse_structure (const char *structure)
201\end{DoxyVerb}
202 Collects a statistic of structure elements of the full structure in bracket notation.
203
204\begin{DoxySeeAlso}{See also}
205\mbox{\hyperlink{RNAstruct_8h}{RNAstruct.\+h}} for prototypes and more detailed description
206\end{DoxySeeAlso}
207\hypertarget{file_formats}{}\doxysection{File Formats}\label{file_formats}
208\hypertarget{file_formats_msa-formats}{}\doxysubsection{File formats for Multiple Sequence Alignments (\+MSA)}\label{file_formats_msa-formats}
209\hypertarget{file_formats_msa-formats-clustal}{}\doxysubsubsection{Clustal\+W format}\label{file_formats_msa-formats-clustal}
210The {\itshape ClustalW} format is a relatively simple text file containing a single multiple sequence alignment of DNA, RNA, or protein sequences. It was first used as an output format for the {\itshape clustalw} programs, but nowadays it may also be generated by various other sequence alignment tools. The specification is straight forward\+:
211
212
213\begin{DoxyItemize}
214\item The first line starts with the words\begin{DoxyVerb}CLUSTAL W \end{DoxyVerb}
215 or \begin{DoxyVerb}CLUSTALW \end{DoxyVerb}
216
217\item After the above header there is at least one empty line
218\item Finally, one or more blocks of sequence data are following, where each block is separated by at least one empty line
219\end{DoxyItemize}Each line in a blocks of sequence data consists of the sequence name followed by the sequence symbols, separated by at least one whitespace character. Usually, the length of a sequence in one block does not exceed 60 symbols. Optionally, an additional whitespace separated cumulative residue count may follow the sequence symbols. Optionally, a block may be followed by a line depicting the degree of conservation of the respective alignment columns.
220
221\begin{DoxyNote}{Note}
222Sequence names and the sequences must not contain whitespace characters! Allowed gap symbols are the hyphen {\itshape }(\char`\"{}-\/\char`\"{}), and dot {\itshape }(\char`\"{}.\char`\"{}).
223\end{DoxyNote}
224\begin{DoxyWarning}{Warning}
225Please note that many programs that output this format tend to truncate the sequence names to a limited number of characters, for instance the first 15 characters. This can destroy the uniqueness of identifiers in your MSA.
226\end{DoxyWarning}
227Here is an example alignment in ClustalW format\+:
228\begin{DoxyVerbInclude}
229CLUSTAL W (1.83) multiple sequence alignment
230
231
232AL031296.1/85969-86120      CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUAC
233AANU01225121.1/438-603      CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUAC
234AAWR02037329.1/29294-29150  ---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAU
235
236AL031296.1/85969-86120      UCUCGUUGGUGAUAAGGAACAGCU
237AANU01225121.1/438-603      UCUCGUUGGUGAUAAGGAACAGCU
238AAWR02037329.1/29294-29150  GCUAAUUAGUUGUGAGGACCAACU
239\end{DoxyVerbInclude}
240\hypertarget{file_formats_msa-formats-stockholm}{}\doxysubsubsection{Stockholm 1.\+0 format}\label{file_formats_msa-formats-stockholm}
241Here is an example alignment in Stockholm 1.\+0 format\+:
242\begin{DoxyVerbInclude}
243# STOCKHOLM 1.0
244
245#=GF AC   RF01293
246#=GF ID   ACA59
247#=GF DE   Small nucleolar RNA ACA59
248#=GF AU   Wilkinson A
249#=GF SE   Predicted; WAR; Wilkinson A
250#=GF SS   Predicted; WAR; Wilkinson A
251#=GF GA   43.00
252#=GF TC   44.90
253#=GF NC   40.30
254#=GF TP   Gene; snRNA; snoRNA; HACA-box;
255#=GF BM   cmbuild -F CM SEED
256#=GF CB   cmcalibrate --mpi CM
257#=GF SM   cmsearch --cpu 4 --verbose --nohmmonly -E 1000 -Z 549862.597050 CM SEQDB
258#=GF DR   snoRNABase; ACA59;
259#=GF DR   SO; 0001263; ncRNA_gene;
260#=GF DR   GO; 0006396; RNA processing;
261#=GF DR   GO; 0005730; nucleolus;
262#=GF RN   [1]
263#=GF RM   15199136
264#=GF RT   Human box H/ACA pseudouridylation guide RNA machinery.
265#=GF RA   Kiss AM, Jady BE, Bertrand E, Kiss T
266#=GF RL   Mol Cell Biol. 2004;24:5797-5807.
267#=GF WK   Small_nucleolar_RNA
268#=GF SQ   3
269
270
271AL031296.1/85969-86120     CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUACUCUCGUUGGUGAUAAGGAACAGCU
272AANU01225121.1/438-603     CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUACUCUCGUUGGUGAUAAGGAACAGCU
273AAWR02037329.1/29294-29150 ---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAUGCUAAUUAGUUGUGAGGACCAACU
274#=GC SS_cons               -----((((,<<<<<<<<<___________>>>>>>>>>,,,,<<<<<<<______>>>>>>>,,,,,))))::::::::::::
275#=GC RF                    CUGCcccaCAaCacuuguGCCUCaGUUACcCauagguGuAGUGaGgGuggcAaUACccaCcCucgUUgGuggUaAGGAaCAgCU
276//
277\end{DoxyVerbInclude}
278
279
280\begin{DoxySeeAlso}{See also}
281\mbox{\hyperlink{rna_structure_notations_wuss-notation}{Washington University Secondary Structure (WUSS) notation}} on legal characters for the consensus secondary structure line {\itshape SS\+\_\+cons} and their interpretation
282\end{DoxySeeAlso}
283\hypertarget{file_formats_msa-formats-fasta}{}\doxysubsubsection{FASTA (\+Pearson) format}\label{file_formats_msa-formats-fasta}
284\begin{DoxyNote}{Note}
285Sequence names must not contain whitespace characters. Otherwise, the parts after the first whitespace will be dropped. The only allowed gap character is the hyphen {\itshape }(\char`\"{}-\/\char`\"{}).
286\end{DoxyNote}
287Here is an example alignment in FASTA format\+:
288\begin{DoxyVerbInclude}
289>AL031296.1/85969-86120
290CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUAC
291UCUCGUUGGUGAUAAGGAACAGCU
292>AANU01225121.1/438-603
293CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUAC
294UCUCGUUGGUGAUAAGGAACAGCU
295>AAWR02037329.1/29294-29150
296---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAU
297GCUAAUUAGUUGUGAGGACCAACU
298\end{DoxyVerbInclude}
299\hypertarget{file_formats_msa-formats-maf}{}\doxysubsubsection{MAF format}\label{file_formats_msa-formats-maf}
300The multiple alignment format (MAF) is usually used to store multiple alignments on DNA level between entire genomes. It consists of independent blocks of aligned sequences which are annotated by their genomic location. Consequently, an MAF formatted MSA file may contain multiple records. MAF files start with a line \begin{DoxyVerb}##maf
301\end{DoxyVerb}
302 which is optionally extended by whitespace delimited key=value pairs. Lines starting with the character (\char`\"{}\#\char`\"{}) are considered comments and usually ignored.
303
304A MAF block starts with character (\char`\"{}a\char`\"{}) at the beginning of a line, optionally followed by whitespace delimited key=value pairs. The next lines start with character (\char`\"{}s\char`\"{}) and contain sequence information of the form \begin{DoxyVerb}s src start size strand srcSize sequence
305\end{DoxyVerb}
306 where
307\begin{DoxyItemize}
308\item {\itshape src} is the name of the sequence source
309\item {\itshape start} is the start of the aligned region within the source (0-\/based)
310\item {\itshape size} is the length of the aligned region without gap characters
311\item {\itshape strand} is either (\char`\"{}+\char`\"{}) or (\char`\"{}-\/\char`\"{}), depicting the location of the aligned region relative to the source
312\item {\itshape src\+Size} is the size of the entire sequence source, e.\+g. the full chromosome
313\item {\itshape sequence} is the aligned sequence including gaps depicted by the hyphen (\char`\"{}-\/\char`\"{})
314\end{DoxyItemize}Here is an example alignment in MAF format (bluntly taken from the \href{https://cgwb.nci.nih.gov/FAQ/FAQformat.html\#format5}{\texttt{ UCSC Genome browser website}})\+:
315\begin{DoxyVerbInclude}
316##maf version=1 scoring=tba.v8
317# tba.v8 (((human chimp) baboon) (mouse rat))
318# multiz.v7
319# maf_project.v5 _tba_right.maf3 mouse _tba_C
320# single_cov2.v4 single_cov2 /dev/stdin
321
322a score=23262.0
323s hg16.chr7    27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
324s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
325s baboon         116834 38 +   4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
326s mm4.chr6     53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
327s rn3.chr4     81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
328
329a score=5062.0
330s hg16.chr7    27699739 6 + 158545518 TAAAGA
331s panTro1.chr6 28862317 6 + 161576975 TAAAGA
332s baboon         241163 6 +   4622798 TAAAGA
333s mm4.chr6     53303881 6 + 151104725 TAAAGA
334s rn3.chr4     81444246 6 + 187371129 taagga
335
336a score=6636.0
337s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
338s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
339s baboon         249182 13 +   4622798 gcagctgaaaaca
340s mm4.chr6     53310102 13 + 151104725 ACAGCTGAAAATA
341
342\end{DoxyVerbInclude}
343\hypertarget{file_formats_constraint-formats}{}\doxysubsection{File formats to manipulate the RNA folding grammar}\label{file_formats_constraint-formats}
344\hypertarget{file_formats_constraint-formats-file}{}\doxysubsubsection{Command Files}\label{file_formats_constraint-formats-file}
345The RNAlib and many programs of the Vienna\+RNA Package can parse and apply data from so-\/called command files. These commands may refer to structure constraints or even extensions of the RNA folding grammar (such as \mbox{\hyperlink{group__domains__up}{Unstructured Domains}}). Commands are given as a line of whitespace delimited data fields. The syntax we use extends the constraint definitions used in the \href{http://mfold.rna.albany.edu/?q=mfold}{\texttt{ mfold}} / \href{http://mfold.rna.albany.edu/?q=DINAMelt/software}{\texttt{ UNAfold}} software, where each line begins with a command character followed by a set of positions.~\newline
346However, we introduce several new commands, and allow for an optional loop type context specifier in form of a sequence of characters, and an orientation flag that enables one to force a nucleotide to pair upstream, or downstream.\hypertarget{file_formats_constraint_commands}{}\doxyparagraph{Constraint commands}\label{file_formats_constraint_commands}
347The following set of commands is recognized\+:
348\begin{DoxyItemize}
349\item {\ttfamily F} $ \ldots $ Force
350\item {\ttfamily P} $ \ldots $ Prohibit
351\item {\ttfamily C} $ \ldots $ Conflicts/\+Context dependency
352\item {\ttfamily A} $ \ldots $ Allow (for non-\/canonical pairs)
353\item {\ttfamily E} $ \ldots $ Soft constraints for unpaired position(s), or base pair(s)
354\end{DoxyItemize}\hypertarget{file_formats_domain_commands}{}\doxyparagraph{RNA folding grammar exensions}\label{file_formats_domain_commands}
355
356\begin{DoxyItemize}
357\item {\ttfamily UD} $ \ldots $ Add ligand binding using the \mbox{\hyperlink{group__domains__up}{Unstructured Domains}} feature
358\end{DoxyItemize}\hypertarget{file_formats_command_file_loop_types}{}\doxyparagraph{Specification of the loop type context}\label{file_formats_command_file_loop_types}
359The optional loop type context specifier {\ttfamily }\mbox{[}LOOP\mbox{]} may be a combination of the following\+:
360\begin{DoxyItemize}
361\item {\ttfamily E} $ \ldots $ Exterior loop
362\item {\ttfamily H} $ \ldots $ Hairpin loop
363\item {\ttfamily I} $ \ldots $ Interior loop
364\item {\ttfamily M} $ \ldots $ Multibranch loop
365\item {\ttfamily A} $ \ldots $ All loops
366\end{DoxyItemize}
367
368For structure constraints, we additionally allow one to address base pairs enclosed by a particular kind of loop, which results in the specifier {\ttfamily }\mbox{[}WHERE\mbox{]} which consists of {\ttfamily }\mbox{[}LOOP\mbox{]} plus the following character\+:
369\begin{DoxyItemize}
370\item {\ttfamily i} $ \ldots $ enclosed pair of an Interior loop
371\item {\ttfamily m} $ \ldots $ enclosed pair of a Multibranch loop
372\end{DoxyItemize}
373
374If no {\ttfamily }\mbox{[}LOOP\mbox{]} or {\ttfamily }\mbox{[}WHERE\mbox{]} flags are set, all contexts are considered (equivalent to {\ttfamily A} )\hypertarget{file_formats_const_file_orientation}{}\doxyparagraph{Controlling the orientation of base pairing}\label{file_formats_const_file_orientation}
375For particular nucleotides that are forced to pair, the following {\ttfamily }\mbox{[}ORIENTATION\mbox{]} flags may be used\+:
376\begin{DoxyItemize}
377\item {\ttfamily U} $ \ldots $ Upstream
378\item {\ttfamily D} $ \ldots $ Downstream
379\end{DoxyItemize}
380
381If no {\ttfamily }\mbox{[}ORIENTATION\mbox{]} flag is set, both directions are considered.\hypertarget{file_formats_const_file_seq_coords}{}\doxyparagraph{Sequence coordinates}\label{file_formats_const_file_seq_coords}
382Sequence positions of nucleotides/base pairs are $ 1- $ based and consist of three positions $ i $, $ j $, and $ k $. Alternativly, four positions may be provided as a pair of two position ranges $ [i:j] $, and $ [k:l] $ using the \textquotesingle{}-\/\textquotesingle{} sign as delimiter within each range, i.\+e. $ i-j $, and $ k-l $.\hypertarget{file_formats_const_file_syntax}{}\doxyparagraph{Valid constraint commands}\label{file_formats_const_file_syntax}
383Below are resulting general cases that are considered {\itshape valid} constraints\+:
384
385
386\begin{DoxyEnumerate}
387\item {\bfseries{\char`\"{}\+Forcing a range of nucleotide positions to be paired\char`\"{}}}\+:~\newline
388 Syntax\+:
389\begin{DoxyCode}{0}
390\DoxyCodeLine{F i 0 k [WHERE] [ORIENTATION] }
391
392\end{DoxyCode}
393~\newline
394 Description\+:~\newline
395 Enforces the set of $ k $ consecutive nucleotides starting at position $ i $ to be paired. The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to force them to appear as closing/enclosed pairs of certain types of loops.
396\item {\bfseries{\char`\"{}\+Forcing a set of consecutive base pairs to form\char`\"{}}}\+:~\newline
397 Syntax\+:\begin{DoxyVerb}F i j k [WHERE] \end{DoxyVerb}
398~\newline
399 Description\+:~\newline
400 Enforces the base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $ to form. The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to specify in which loop context the base pair must appear.
401\item {\bfseries{\char`\"{}\+Prohibiting a range of nucleotide positions to be paired\char`\"{}}}\+:~\newline
402 Syntax\+:\begin{DoxyVerb}P i 0 k [WHERE] \end{DoxyVerb}
403~\newline
404 Description\+:~\newline
405 Prohibit a set of $ k $ consecutive nucleotides to participate in base pairing, i.\+e. make these positions unpaired. The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to force the nucleotides to appear within the loop of specific types.
406\item {\bfseries{\char`\"{}\+Probibiting a set of consecutive base pairs to form\char`\"{}}}\+:~\newline
407 Syntax\+:\begin{DoxyVerb}P i j k [WHERE] \end{DoxyVerb}
408~\newline
409 Description\+:~\newline
410 Probibit the base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $ to form. The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to specify the type of loop they are disallowed to be the closing or an enclosed pair of.
411\item {\bfseries{\char`\"{}\+Prohibiting two ranges of nucleotides to pair with each other\char`\"{}}}\+:~\newline
412 Syntax\+:\begin{DoxyVerb}P i-j k-l [WHERE] \end{DoxyVerb}
413 Description\+:~\newline
414 Prohibit any nucleotide $ p \in [i:j] $ to pair with any other nucleotide $ q \in [k:l] $. The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to specify the type of loop they are disallowed to be the closing or an enclosed pair of.
415\item {\bfseries{\char`\"{}\+Enforce a loop context for a range of nucleotide positions\char`\"{}}}\+:~\newline
416 Syntax\+:\begin{DoxyVerb}C i 0 k [WHERE] \end{DoxyVerb}
417 Description\+:~\newline
418 This command enforces nucleotides to be unpaired similar to {\itshape prohibiting} nucleotides to be paired, as described above. It too marks the corresponding nucleotides to be unpaired, however, the {\ttfamily }\mbox{[}WHERE\mbox{]} flag can be used to enforce specfic loop types the nucleotides must appear in.
419\item {\bfseries{\char`\"{}\+Remove pairs that conflict with a set of consecutive base pairs\char`\"{}}}\+:~\newline
420 Syntax\+:\begin{DoxyVerb}C i j k \end{DoxyVerb}
421~\newline
422 Description\+:~\newline
423 Remove all base pairs that conflict with a set of consecutive base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $. Two base pairs $ (i,j) $ and $ (p,q) $ conflict with each other if $ i < p < j < q $, or $ p < i < q < j $.
424\item {\bfseries{\char`\"{}\+Allow a set of consecutive (non-\/canonical) base pairs to form\char`\"{}}}\+:~\newline
425 Syntax\+:
426\begin{DoxyCode}{0}
427\DoxyCodeLine{A i j k [WHERE] }
428
429\end{DoxyCode}
430~\newline
431 Description\+:~\newline
432 This command enables the formation of the consecutive base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $, no matter if they are {\itshape canonical}, or {\itshape non-\/canonical}. In contrast to the above {\ttfamily F} and {\ttfamily W} commands, which remove conflicting base pairs, the {\ttfamily A} command does not. Therefore, it may be used to allow {\itshape non-\/canoncial} base pair interactions. Since the RNAlib does not contain free energy contributions $ E_{ij} $ for non-\/canonical base pairs $ (i,j) $, they are scored as the {\itshape maximum} of similar, known contributions. In terms of a {\itshape Nussinov} like scoring function the free energy of non-\/canonical base pairs is therefore estimated as \[ E_{ij} = \min \left[ \max_{(i,k) \in \{GC, CG, AU, UA, GU, UG\}} E_{ik}, \max_{(k,j) \in \{GC, CG, AU, UA, GU, UG\}} E_{kj} \right]. \] The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to specify in which loop context the base pair may appear.
433\item {\bfseries{\char`\"{}\+Apply pseudo free energy to a range of unpaired nucleotide positions\char`\"{}}}\+:~\newline
434 Syntax\+:
435\begin{DoxyCode}{0}
436\DoxyCodeLine{E i 0 k e }
437
438\end{DoxyCode}
439~\newline
440 Description\+:~\newline
441 Use this command to apply a pseudo free energy of $ e $ to the set of $ k $ consecutive nucleotides, starting at position $ i $. The pseudo free energy is applied only if these nucleotides are considered unpaired in the recursions, or evaluations, and is expected to be given in $ kcal / mol $.
442\item {\bfseries{\char`\"{}\+Apply pseudo free energy to a set of consecutive base pairs\char`\"{}}}\+:~\newline
443 Syntax
444\begin{DoxyCode}{0}
445\DoxyCodeLine{E i j k e }
446
447\end{DoxyCode}
448~\newline
449 Use this command to apply a pseudo free energy of $ e $ to the set of base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $. Energies are expected to be given in $ kcal / mol $.
450\end{DoxyEnumerate}\hypertarget{file_formats_domains_syntax}{}\doxyparagraph{Valid domain extensions commands}\label{file_formats_domains_syntax}
451
452\begin{DoxyEnumerate}
453\item {\bfseries{\char`\"{}\+Add ligand binding to unpaired motif (a.\+k.\+a. unstructured domains)\char`\"{}}}\+:~\newline
454 Syntax\+:
455\begin{DoxyCode}{0}
456\DoxyCodeLine{UD m e [LOOP] }
457
458\end{DoxyCode}
459~\newline
460 Description\+:~\newline
461 Add ligand binding to unpaired sequence motif $ m $ (given in IUPAC format, capital letters) with binding energy $ e $ in particular loop type(s).~\newline
462 Example\+:
463\begin{DoxyCode}{0}
464\DoxyCodeLine{UD  AAA   -\/5.0    A}
465
466\end{DoxyCode}
467~\newline
468 The above example applies a binding free energy of $ -5 kcal/mol $ for a motif AAA that may be present in all loop types.
469\end{DoxyEnumerate}\hypertarget{plots}{}\doxysection{Plotting}\label{plots}
470Create Plots of Secondary Structures, Feature Motifs, and Sequence Alignments\hypertarget{plots_utils_ss}{}\doxysubsection{Producing secondary structure graphs}\label{plots_utils_ss}
471\begin{DoxyVerb}int PS_rna_plot ( char *string,
472                  char *structure,
473                  char *file)
474\end{DoxyVerb}
475 Produce a secondary structure graph in Post\+Script and write it to \textquotesingle{}filename\textquotesingle{}.
476
477\begin{DoxyVerb}int PS_rna_plot_a (
478            char *string,
479            char *structure,
480            char *file,
481            char *pre,
482            char *post)
483\end{DoxyVerb}
484 Produce a secondary structure graph in Post\+Script including additional annotation macros and write it to \textquotesingle{}filename\textquotesingle{}.
485
486\begin{DoxyVerb}int gmlRNA (char *string,
487            char *structure,
488            char *ssfile,
489            char option)
490\end{DoxyVerb}
491 Produce a secondary structure graph in Graph Meta Language (gml) and write it to a file.
492
493\begin{DoxyVerb}int ssv_rna_plot (char *string,
494                  char *structure,
495                  char *ssfile)
496\end{DoxyVerb}
497 Produce a secondary structure graph in SStruct\+View format.
498
499\begin{DoxyVerb}int svg_rna_plot (char *string,
500                  char *structure,
501                  char *ssfile)
502\end{DoxyVerb}
503 Produce a secondary structure plot in SVG format and write it to a file.
504
505\begin{DoxyVerb}int xrna_plot ( char *string,
506                char *structure,
507                char *ssfile)
508\end{DoxyVerb}
509 Produce a secondary structure plot for further editing in XRNA.
510
511\begin{DoxyVerb}int rna_plot_type
512\end{DoxyVerb}
513 Switch for changing the secondary structure layout algorithm.
514
515Two low-\/level functions provide direct access to the graph lauyouting algorithms\+:
516
517\begin{DoxyVerb}int simple_xy_coordinates ( short *pair_table,
518                            float *X,
519                            float *Y)
520\end{DoxyVerb}
521 Calculate nucleotide coordinates for secondary structure plot the {\itshape Simple way}
522
523\begin{DoxyVerb}int naview_xy_coordinates ( short *pair_table,
524                            float *X,
525                            float *Y)
526\end{DoxyVerb}
527
528
529\begin{DoxySeeAlso}{See also}
530\mbox{\hyperlink{PS__dot_8h}{PS\+\_\+dot.\+h}} and naview.\+h for more detailed descriptions.
531\end{DoxySeeAlso}
532\hypertarget{plots_utils_dot}{}\doxysubsection{Producing (colored) dot plots for base pair probabilities}\label{plots_utils_dot}
533\begin{DoxyVerb}int PS_color_dot_plot ( char *string,
534                        cpair *pi,
535                        char *filename)
536\end{DoxyVerb}
537
538
539\begin{DoxyVerb}int PS_color_dot_plot_turn (char *seq,
540                            cpair *pi,
541                            char *filename,
542                            int winSize)
543\end{DoxyVerb}
544
545
546\begin{DoxyVerb}int PS_dot_plot_list (char *seq,
547                      char *filename,
548                      plist *pl,
549                      plist *mf,
550                      char *comment)
551\end{DoxyVerb}
552 Produce a postscript dot-\/plot from two pair lists.
553
554\begin{DoxyVerb}int PS_dot_plot_turn (char *seq,
555                      struct plist *pl,
556                      char *filename,
557                      int winSize)
558\end{DoxyVerb}
559
560
561\begin{DoxySeeAlso}{See also}
562\mbox{\hyperlink{PS__dot_8h}{PS\+\_\+dot.\+h}} for more detailed descriptions.
563\end{DoxySeeAlso}
564\hypertarget{plots_utils_aln}{}\doxysubsection{Producing (colored) alignments}\label{plots_utils_aln}
565\begin{DoxyVerb}int PS_color_aln (
566            const char *structure,
567            const char *filename,
568            const char *seqs[],
569            const char *names[])
570\end{DoxyVerb}
571 Produce Post\+Script sequence alignment color-\/annotated by consensus structure.