1Below, you\textquotesingle{}ll find a listing of different sections that introduce the most common notations of sequence and structure data, specifications of bioinformatics sequence and structure file formats, and various output file formats produced by our library. 2 3 4\begin{DoxyItemize} 5\item \mbox{\hyperlink{rna_structure_notations}{RNA Structure Notations}} describes the different notations and representations of RNA secondary structures 6\item \mbox{\hyperlink{file_formats}{File Formats}} gives an overview of the file formats compatible with our library 7\item \mbox{\hyperlink{plots}{Plotting}} shows the different (Post\+Script) plotting functions for RNA secondary structures, feature probabilities, and multiple sequence alignments 8\end{DoxyItemize}\hypertarget{rna_structure_notations}{}\doxysection{RNA Structure Notations}\label{rna_structure_notations} 9\hypertarget{rna_structure_notations_sec_structure_representations}{}\doxysubsection{Representations of Secondary Structures}\label{rna_structure_notations_sec_structure_representations} 10The standard representation of a secondary structure in our library is the \mbox{\hyperlink{rna_structure_notations_dot-bracket-notation}{Dot-\/\+Bracket Notation (a.\+k.\+a. Dot-\/\+Parenthesis Notation)}}, where matching brackets symbolize base pairs and unpaired bases are shown as dots. Based on that notation, more elaborate representations have been developed to include additional information, such as the loop context a nucleotide belongs to and to annotated pseudo-\/knots.\hypertarget{rna_structure_notations_dot-bracket-notation}{}\doxysubsubsection{Dot-\/\+Bracket Notation (a.\+k.\+a. Dot-\/\+Parenthesis Notation)}\label{rna_structure_notations_dot-bracket-notation} 11The Dot-\/\+Bracket notation as introduced already in the early times of the Vienna\+RNA Package denotes base pairs by matching pairs of parenthesis {\ttfamily ()} and unpaired nucleotides by dots {\ttfamily .}. 12 13As a simple example, consider a helix of size 4 enclosing a hairpin of size 4. In dot-\/bracket notation, this is annotated as 14 15{\ttfamily ((((....))))} 16 17{\bfseries{Extended Dot-\/\+Bracket Notation}} 18 19A more generalized version of the original Dot-\/\+Bracket notation may use additional pairs of brackets, such as {\ttfamily $<$$>$}, {\ttfamily \{\}}, and {\ttfamily \mbox{[}\mbox{]}}, and matching pairs of uppercase/lowercase letters. This allows for anotating pseudo-\/knots, since different pairs of brackets are not required to be nested. 20 21The follwing annotations of a simple structure with two crossing helices of size 4 are equivalent\+: 22 23{\ttfamily $<$$<$$<$$<$\mbox{[}\mbox{[}\mbox{[}\mbox{[}....$>$$>$$>$$>$\mbox{]}\mbox{]}\mbox{]}\mbox{]}}~\newline 24 {\ttfamily ((((AAAA....))))aaaa}~\newline 25 {\ttfamily AAAA\{\{\{\{....aaaa\}\}\}\}} \begin{DoxySeeAlso}{See also} 26\mbox{\hyperlink{group__struct__utils__dot__bracket_ga55c4783060a1464f862f858d5599c9e1}{vrna\+\_\+db\+\_\+pack()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_ga6490adff857d84ce06e6f379ae3a4512}{vrna\+\_\+db\+\_\+unpack()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_gafd1304f5a86e2e3f1425e725cde44fa2}{vrna\+\_\+db\+\_\+flatten()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_ga690425199c8b71545e7196e3af1436f8}{vrna\+\_\+db\+\_\+flatten\+\_\+to()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_gaf9ecd0d7877fecdbb0292e24f40283d5}{vrna\+\_\+db\+\_\+from\+\_\+ptable()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_ga6a51a36b9245d0bac868c5cd172b9611}{vrna\+\_\+db\+\_\+from\+\_\+plist()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_ga45360c09fb6d04d96e42dcccbb66015b}{vrna\+\_\+db\+\_\+to\+\_\+element\+\_\+string()}}, \mbox{\hyperlink{group__struct__utils__dot__bracket_ga97dbebaa3fc49524cf5afa338a6c52ee}{vrna\+\_\+db\+\_\+pk\+\_\+remove()}} 27\end{DoxySeeAlso} 28\hypertarget{rna_structure_notations_wuss-notation}{}\doxysubsubsection{Washington University Secondary Structure (\+WUSS) notation}\label{rna_structure_notations_wuss-notation} 29The WUSS notation, as frequently used for consensus secondary structures in \mbox{\hyperlink{file_formats_msa-formats-stockholm}{Stockholm 1.\+0 format}}. 30 31This notation allows for a fine-\/grained annotation of base pairs and unpaired nucleotides, including pseudo-\/knots. Below, you\textquotesingle{}ll find a list of secondary structure elements and their corresponding WUSS annotation (See also the infernal user guide at \href{http://eddylab.org/infernal/Userguide.pdf}{\texttt{ http\+://eddylab.\+org/infernal/\+Userguide.\+pdf}}) 32\begin{DoxyItemize} 33\item {\bfseries{Base pairs}}~\newline 34 Nested base pairs are annotated by matching pairs of the symbols {\ttfamily $<$$>$}, {\ttfamily ()}, {\ttfamily \{\}}, and {\ttfamily \mbox{[}\mbox{]}}. Each of the matching pairs of parenthesis have their special meaning, however, when used as input in our programs, e.\+g. structure constraint, these details are usually ignored. Furthermore, base pairs that constitute as pseudo-\/knot are denoted by letters from the latin alphabet and are, if not denoted otherwise, ignored entirely in our programs. 35\item {\bfseries{Hairpin loops}}~\newline 36 Unpaired nucleotides that constitute the hairpin loop are indicated by underscores, {\ttfamily \+\_\+}. 37 38Example\+: {\ttfamily $<$$<$$<$$<$$<$\+\_\+\+\_\+\+\_\+\+\_\+\+\_\+$>$$>$$>$$>$$>$} 39\item {\bfseries{Bulges and interior loops}}~\newline 40 Residues that constitute a bulge or interior loop are denoted by dashes, {\ttfamily -\/}. 41 42Example\+: {\ttfamily (((-\/-\/$<$$<$\+\_\+\+\_\+\+\_\+\+\_\+\+\_\+$>$$>$-\/)))} 43\item {\bfseries{Multibranch loops}}~\newline 44 Unpaired nucleotides in multibranch loops are indicated by commas {\ttfamily ,}. 45 46Example\+: {\ttfamily (((,,$<$$<$\+\_\+\+\_\+\+\_\+\+\_\+\+\_\+$>$$>$,$<$$<$\+\_\+\+\_\+\+\_\+\+\_\+$>$$>$)))} 47\item {\bfseries{External residues}}~\newline 48 Single stranded nucleotides in the exterior loop, i.\+e. not enclosed by any other pair are denoted by colons, {\ttfamily \+:}. 49 50Example\+: {\ttfamily $<$$<$$<$\+\_\+\+\_\+\+\_\+\+\_\+$>$$>$$>$\+::\+:} 51\item {\bfseries{Insertions}}~\newline 52 In cases where an alignment represents the consensus with a known structure, insertions relative to the known structure are denoted by periods, {\ttfamily .}. Regions where local structural alignment was invoked, leaving regions of both target and query sequence unaligned, are indicated by tildes, {\ttfamily $\sim$}. \begin{DoxyNote}{Note} 53These symbols only appear in alignments of a known (query) structure annotation to a target sequence of unknown structure. 54\end{DoxyNote} 55 56\item {\bfseries{Pseudo-\/knots}}~\newline 57 The WUSS notation allows for annotation of pseudo-\/knots using pairs of upper-\/case/lower-\/case letters. \begin{DoxyNote}{Note} 58Our programs and library functions usually ignore pseudo-\/knots entirely treating them as unpaired nucleotides, if not stated otherwise. 59\end{DoxyNote} 60Example\+: {\ttfamily $<$$<$$<$\+\_\+\+AAA\+\_\+\+\_\+\+\_\+$>$$>$$>$aaa} 61\end{DoxyItemize} 62 63\begin{DoxySeeAlso}{See also} 64\mbox{\hyperlink{group__struct__utils__wuss_ga02ca70cffb2d864f7b2d95d92218bae0}{vrna\+\_\+db\+\_\+from\+\_\+\+WUSS()}} 65\end{DoxySeeAlso} 66\hypertarget{rna_structure_notations_shapes-notation}{}\doxysubsubsection{Abstract Shapes}\label{rna_structure_notations_shapes-notation} 67Abstract Shapes, introduced by Giegerich et al. in (2004) \cite{giegerich:2004}, collapse the secondary structure while retaining the nestedness of helices and hairpin loops. 68 69The abstract shapes representation abstracts the structure from individual base pairs and their corresponding location in the sequence, while retaining the inherent nestedness of helices and hairpin loops. 70 71Below is a description of what is included in the abstract shapes abstraction for each respective level together with an example structure\+: \begin{DoxyVerb}CGUCUUAAACUCAUCACCGUGUGGAGCUGCGACCCUUCCCUAGAUUCGAAGACGAG 72((((((...(((..(((...))))))...(((..((.....))..))))))))).. 73\end{DoxyVerb} 74 \DoxyHorRuler{0} 75 76 77\tabulinesep=1mm 78\begin{longtabu}spread 0pt [c]{*{3}{|X[-1]}|} 79\hline 80\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Shape Level }&\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Description }&\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Result }\\\cline{1-3} 81\endfirsthead 82\hline 83\endfoot 84\hline 85\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Shape Level }&\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Description }&\PBS\centering \cellcolor{\tableheadbgcolor}\textbf{ Result }\\\cline{1-3} 86\endhead 871 &Most accurate -\/ all loops and all unpaired &{\ttfamily \mbox{[}\+\_\+\mbox{[}\+\_\+\mbox{[}\mbox{]}\mbox{]}\+\_\+\mbox{[}\+\_\+\mbox{[}\mbox{]}\+\_\+\mbox{]}\mbox{]}\+\_\+} \\\cline{1-3} 882 &Nesting pattern for all loop types and unpaired regions in external loop and multiloop &{\ttfamily \mbox{[}\mbox{[}\+\_\+\mbox{[}\mbox{]}\mbox{]}\mbox{[}\+\_\+\mbox{[}\mbox{]}\+\_\+\mbox{]}\mbox{]}} \\\cline{1-3} 893 &Nesting pattern for all loop types but no unpaired regions &{\ttfamily \mbox{[}\mbox{[}\mbox{[}\mbox{]}\mbox{]}\mbox{[}\mbox{[}\mbox{]}\mbox{]}\mbox{]}} \\\cline{1-3} 904 &Helix nesting pattern in external loop and multiloop &{\ttfamily \mbox{[}\mbox{[}\mbox{]}\mbox{[}\mbox{[}\mbox{]}\mbox{]}\mbox{]}} \\\cline{1-3} 915 &Most abstract -\/ helix nesting pattern and no unpaired regions &{\ttfamily \mbox{[}\mbox{[}\mbox{]}\mbox{[}\mbox{]}\mbox{]}} \\\cline{1-3} 92\end{longtabu} 93 94 95\begin{DoxyNote}{Note} 96Our implementations also provide the special Shape Level 0, which does not collapse any structural features but simply convert base pairs and unpaired nucleotides into their corresponding set of symbols for abstract shapes. 97\end{DoxyNote} 98\begin{DoxySeeAlso}{See also} 99\mbox{\hyperlink{group__struct__utils__abstract__shapes_gafca0add98ede22bf2c22608878c61b22}{vrna\+\_\+abstract\+\_\+shapes()}}, \mbox{\hyperlink{group__struct__utils__abstract__shapes_ga2fd59087e1c4e3d460e5823ba6d693b4}{vrna\+\_\+abstract\+\_\+shapes\+\_\+pt()}} 100\end{DoxySeeAlso} 101\hypertarget{rna_structure_notations_sec_structure_representations_tree}{}\doxysubsubsection{Tree Representations of Secondary Structures}\label{rna_structure_notations_sec_structure_representations_tree} 102Secondary structures can be readily represented as trees, where internal nodes represent base pairs, and leaves represent unpaired nucleotides. The dot-\/bracket structure string already is a tree represented by a string of parenthesis (base pairs) and dots for the leaf nodes (unpaired nucleotides). 103 104Alternatively, one may find representations with two types of node labels, {\ttfamily P} for paired and {\ttfamily U} for unpaired; a dot is then replaced by {\ttfamily (U)}, and each closed bracket is assigned an additional identifier {\ttfamily P}. We call this the expanded notation. In \cite{fontana:1993b} a condensed representation of the secondary structure is proposed, the so-\/called homeomorphically irreducible tree (HIT) representation. Here a stack is represented as a single pair of matching brackets labeled {\ttfamily P} and weighted by the number of base pairs. Correspondingly, a contiguous strain of unpaired bases is shown as one pair of matching brackets labeled {\ttfamily U} and weighted by its length. Generally any string consisting of matching brackets and identifiers is equivalent to a plane tree with as many different types of nodes as there are identifiers. 105 106Bruce Shapiro proposed a coarse grained representation \cite{shapiro:1988}, which, does not retain the full information of the secondary structure. He represents the different structure elements by single matching brackets and labels them as 107 108 109\begin{DoxyItemize} 110\item {\ttfamily H} (hairpin loop), 111\item {\ttfamily I} (interior loop), 112\item {\ttfamily B} (bulge), 113\item {\ttfamily M} (multi-\/loop), and 114\item {\ttfamily S} (stack). 115\end{DoxyItemize} 116 117We extend his alphabet by an extra letter for external elements {\ttfamily E}. Again these identifiers may be followed by a weight corresponding to the number of unpaired bases or base pairs in the structure element. All tree representations (except for the dot-\/bracket form) can be encapsulated into a virtual root (labeled {\ttfamily R}). 118 119The following example illustrates the different linear tree representations used by the package\+: 120 121Consider the secondary structure represented by the dot-\/bracket string (full tree) {\ttfamily .((..(((...)))..((..)))).} which is the most convenient condensed notation used by our programs and library functions. 122 123Then, the following tree representations are equivalent\+: 124 125 126\begin{DoxyItemize} 127\item Expanded tree\+:~\newline 128 {\ttfamily ((U)(((U)(U)((((U)(U)(U)P)P)P)(U)(U)(((U)(U)P)P)P)P)(U)R)} 129\item HIT representation (Fontana et al. 1993 \cite{fontana:1993b})\+:~\newline 130 {\ttfamily ((U1)((U2)((U3)P3)(U2)((U2)P2)P2)(U1)R)} 131\item Coarse Grained \mbox{\hyperlink{structTree}{Tree}} Representation (Shapiro 1988 \cite{shapiro:1988})\+: 132\begin{DoxyItemize} 133\item Short (with root node {\ttfamily R}, without stem nodes {\ttfamily S})\+:~\newline 134 {\ttfamily ((H)((H)M)R)} 135\item Full (with root node {\ttfamily R})\+:~\newline 136 {\ttfamily (((((H)S)((H)S)M)S)R)} 137\item Extended (with root node {\ttfamily R}, with external nodes {\ttfamily E})\+:~\newline 138 {\ttfamily ((((((H)S)((H)S)M)S)E)R)} 139\item Weighted (with root node {\ttfamily R}, with external nodes {\ttfamily E})\+:~\newline 140 {\ttfamily ((((((H3)S3)((H2)S2)M4)S2)E2)R)} 141\end{DoxyItemize} 142\end{DoxyItemize} 143 144The Expanded tree is rather clumsy and mostly included for the sake of completeness. The different versions of Coarse Grained \mbox{\hyperlink{structTree}{Tree}} Representations are variatios of Shapiro\textquotesingle{}s linear tree notation. 145 146For the output of aligned structures from string editing, different representations are needed, where we put the label on both sides. The above examples for tree representations would then look like\+: 147 148\begin{DoxyVerb}* a) (UU)(P(P(P(P(UU)(UU)(P(P(P(UU)(UU)(UU)P)P)P)(UU)(UU)(P(P(UU)(U... 149* b) (UU)(P2(P2(U2U2)(P2(U3U3)P3)(U2U2)(P2(U2U2)P2)P2)(UU)P2)(UU) 150* c) (B(M(HH)(HH)M)B) 151* (S(B(S(M(S(HH)S)(S(HH)S)M)S)B)S) 152* (E(S(B(S(M(S(HH)S)(S(HH)S)M)S)B)S)E) 153* d) (R(E2(S2(B1(S2(M4(S3(H3)S3)((H2)S2)M4)S2)B1)S2)E2)R) 154* \end{DoxyVerb} 155 156 157Aligned structures additionally contain the gap character {\ttfamily \+\_\+}. \begin{DoxySeeAlso}{See also} 158\mbox{\hyperlink{group__struct__utils__tree_ga56551ab7da64933a7230d29430f40cfe}{vrna\+\_\+db\+\_\+to\+\_\+tree\+\_\+string()}}, \mbox{\hyperlink{group__struct__utils__tree_gaa31da26a3f582ddc35a84ff1b9c0a2b0}{vrna\+\_\+tree\+\_\+string\+\_\+unweight()}}, \mbox{\hyperlink{group__struct__utils__tree_ga99d280319a7fd3f87e9f0d8c44520774}{vrna\+\_\+tree\+\_\+string\+\_\+to\+\_\+db()}} 159\end{DoxySeeAlso} 160\hypertarget{rna_structure_notations_structure_notations_examples}{}\doxysubsection{Examples for Structure Parsing and Conversion}\label{rna_structure_notations_structure_notations_examples} 161\hypertarget{rna_structure_notations_structure_notations_api}{}\doxysubsection{Structure Parsing and Conversion API}\label{rna_structure_notations_structure_notations_api} 162Several functions are provided for parsing structures and converting to different representations. 163 164\begin{DoxyVerb}char *expand_Full(const char *structure) 165\end{DoxyVerb} 166 Convert the full structure from bracket notation to the expanded notation including root. 167 168\begin{DoxyVerb}char *b2HIT (const char *structure) 169\end{DoxyVerb} 170 Converts the full structure from bracket notation to the HIT notation including root. 171 172\begin{DoxyVerb}char *b2C (const char *structure) 173\end{DoxyVerb} 174 Converts the full structure from bracket notation to the a coarse grained notation using the \textquotesingle{}H\textquotesingle{} \textquotesingle{}B\textquotesingle{} \textquotesingle{}I\textquotesingle{} \textquotesingle{}M\textquotesingle{} and \textquotesingle{}R\textquotesingle{} identifiers. 175 176\begin{DoxyVerb}char *b2Shapiro (const char *structure) 177\end{DoxyVerb} 178 Converts the full structure from bracket notation to the {\itshape weighted} coarse grained notation using the \textquotesingle{}H\textquotesingle{} \textquotesingle{}B\textquotesingle{} \textquotesingle{}I\textquotesingle{} \textquotesingle{}M\textquotesingle{} \textquotesingle{}S\textquotesingle{} \textquotesingle{}E\textquotesingle{} and \textquotesingle{}R\textquotesingle{} identifiers. 179 180\begin{DoxyVerb}char *expand_Shapiro (const char *coarse); 181\end{DoxyVerb} 182 Inserts missing \textquotesingle{}S\textquotesingle{} identifiers in unweighted coarse grained structures as obtained from \mbox{\hyperlink{group__struct__utils__deprecated_ga9c80d92391f2833549a8b6dac92233f0}{b2\+C()}}. 183 184\begin{DoxyVerb}char *add_root (const char *structure) 185\end{DoxyVerb} 186 Adds a root to an un-\/rooted tree in any except bracket notation. 187 188\begin{DoxyVerb}char *unexpand_Full (const char *ffull) 189\end{DoxyVerb} 190 Restores the bracket notation from an expanded full or HIT tree, that is any tree using only identifiers \textquotesingle{}U\textquotesingle{} \textquotesingle{}P\textquotesingle{} and \textquotesingle{}R\textquotesingle{}. 191 192\begin{DoxyVerb}char *unweight (const char *wcoarse) 193\end{DoxyVerb} 194 Strip weights from any weighted tree. 195 196\begin{DoxyVerb}void unexpand_aligned_F (char *align[2]) 197\end{DoxyVerb} 198 Converts two aligned structures in expanded notation. 199 200\begin{DoxyVerb}void parse_structure (const char *structure) 201\end{DoxyVerb} 202 Collects a statistic of structure elements of the full structure in bracket notation. 203 204\begin{DoxySeeAlso}{See also} 205\mbox{\hyperlink{RNAstruct_8h}{RNAstruct.\+h}} for prototypes and more detailed description 206\end{DoxySeeAlso} 207\hypertarget{file_formats}{}\doxysection{File Formats}\label{file_formats} 208\hypertarget{file_formats_msa-formats}{}\doxysubsection{File formats for Multiple Sequence Alignments (\+MSA)}\label{file_formats_msa-formats} 209\hypertarget{file_formats_msa-formats-clustal}{}\doxysubsubsection{Clustal\+W format}\label{file_formats_msa-formats-clustal} 210The {\itshape ClustalW} format is a relatively simple text file containing a single multiple sequence alignment of DNA, RNA, or protein sequences. It was first used as an output format for the {\itshape clustalw} programs, but nowadays it may also be generated by various other sequence alignment tools. The specification is straight forward\+: 211 212 213\begin{DoxyItemize} 214\item The first line starts with the words\begin{DoxyVerb}CLUSTAL W \end{DoxyVerb} 215 or \begin{DoxyVerb}CLUSTALW \end{DoxyVerb} 216 217\item After the above header there is at least one empty line 218\item Finally, one or more blocks of sequence data are following, where each block is separated by at least one empty line 219\end{DoxyItemize}Each line in a blocks of sequence data consists of the sequence name followed by the sequence symbols, separated by at least one whitespace character. Usually, the length of a sequence in one block does not exceed 60 symbols. Optionally, an additional whitespace separated cumulative residue count may follow the sequence symbols. Optionally, a block may be followed by a line depicting the degree of conservation of the respective alignment columns. 220 221\begin{DoxyNote}{Note} 222Sequence names and the sequences must not contain whitespace characters! Allowed gap symbols are the hyphen {\itshape }(\char`\"{}-\/\char`\"{}), and dot {\itshape }(\char`\"{}.\char`\"{}). 223\end{DoxyNote} 224\begin{DoxyWarning}{Warning} 225Please note that many programs that output this format tend to truncate the sequence names to a limited number of characters, for instance the first 15 characters. This can destroy the uniqueness of identifiers in your MSA. 226\end{DoxyWarning} 227Here is an example alignment in ClustalW format\+: 228\begin{DoxyVerbInclude} 229CLUSTAL W (1.83) multiple sequence alignment 230 231 232AL031296.1/85969-86120 CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUAC 233AANU01225121.1/438-603 CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUAC 234AAWR02037329.1/29294-29150 ---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAU 235 236AL031296.1/85969-86120 UCUCGUUGGUGAUAAGGAACAGCU 237AANU01225121.1/438-603 UCUCGUUGGUGAUAAGGAACAGCU 238AAWR02037329.1/29294-29150 GCUAAUUAGUUGUGAGGACCAACU 239\end{DoxyVerbInclude} 240\hypertarget{file_formats_msa-formats-stockholm}{}\doxysubsubsection{Stockholm 1.\+0 format}\label{file_formats_msa-formats-stockholm} 241Here is an example alignment in Stockholm 1.\+0 format\+: 242\begin{DoxyVerbInclude} 243# STOCKHOLM 1.0 244 245#=GF AC RF01293 246#=GF ID ACA59 247#=GF DE Small nucleolar RNA ACA59 248#=GF AU Wilkinson A 249#=GF SE Predicted; WAR; Wilkinson A 250#=GF SS Predicted; WAR; Wilkinson A 251#=GF GA 43.00 252#=GF TC 44.90 253#=GF NC 40.30 254#=GF TP Gene; snRNA; snoRNA; HACA-box; 255#=GF BM cmbuild -F CM SEED 256#=GF CB cmcalibrate --mpi CM 257#=GF SM cmsearch --cpu 4 --verbose --nohmmonly -E 1000 -Z 549862.597050 CM SEQDB 258#=GF DR snoRNABase; ACA59; 259#=GF DR SO; 0001263; ncRNA_gene; 260#=GF DR GO; 0006396; RNA processing; 261#=GF DR GO; 0005730; nucleolus; 262#=GF RN [1] 263#=GF RM 15199136 264#=GF RT Human box H/ACA pseudouridylation guide RNA machinery. 265#=GF RA Kiss AM, Jady BE, Bertrand E, Kiss T 266#=GF RL Mol Cell Biol. 2004;24:5797-5807. 267#=GF WK Small_nucleolar_RNA 268#=GF SQ 3 269 270 271AL031296.1/85969-86120 CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUACUCUCGUUGGUGAUAAGGAACAGCU 272AANU01225121.1/438-603 CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUACUCUCGUUGGUGAUAAGGAACAGCU 273AAWR02037329.1/29294-29150 ---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAUGCUAAUUAGUUGUGAGGACCAACU 274#=GC SS_cons -----((((,<<<<<<<<<___________>>>>>>>>>,,,,<<<<<<<______>>>>>>>,,,,,)))):::::::::::: 275#=GC RF CUGCcccaCAaCacuuguGCCUCaGUUACcCauagguGuAGUGaGgGuggcAaUACccaCcCucgUUgGuggUaAGGAaCAgCU 276// 277\end{DoxyVerbInclude} 278 279 280\begin{DoxySeeAlso}{See also} 281\mbox{\hyperlink{rna_structure_notations_wuss-notation}{Washington University Secondary Structure (WUSS) notation}} on legal characters for the consensus secondary structure line {\itshape SS\+\_\+cons} and their interpretation 282\end{DoxySeeAlso} 283\hypertarget{file_formats_msa-formats-fasta}{}\doxysubsubsection{FASTA (\+Pearson) format}\label{file_formats_msa-formats-fasta} 284\begin{DoxyNote}{Note} 285Sequence names must not contain whitespace characters. Otherwise, the parts after the first whitespace will be dropped. The only allowed gap character is the hyphen {\itshape }(\char`\"{}-\/\char`\"{}). 286\end{DoxyNote} 287Here is an example alignment in FASTA format\+: 288\begin{DoxyVerbInclude} 289>AL031296.1/85969-86120 290CUGCCUCACAACGUUUGUGCCUCAGUUACCCGUAGAUGUAGUGAGGGUAACAAUACUUAC 291UCUCGUUGGUGAUAAGGAACAGCU 292>AANU01225121.1/438-603 293CUGCCUCACAACAUUUGUGCCUCAGUUACUCAUAGAUGUAGUGAGGGUGACAAUACUUAC 294UCUCGUUGGUGAUAAGGAACAGCU 295>AAWR02037329.1/29294-29150 296---CUCGACACCACU---GCCUCGGUUACCCAUCGGUGCAGUGCGGGUAGUAGUACCAAU 297GCUAAUUAGUUGUGAGGACCAACU 298\end{DoxyVerbInclude} 299\hypertarget{file_formats_msa-formats-maf}{}\doxysubsubsection{MAF format}\label{file_formats_msa-formats-maf} 300The multiple alignment format (MAF) is usually used to store multiple alignments on DNA level between entire genomes. It consists of independent blocks of aligned sequences which are annotated by their genomic location. Consequently, an MAF formatted MSA file may contain multiple records. MAF files start with a line \begin{DoxyVerb}##maf 301\end{DoxyVerb} 302 which is optionally extended by whitespace delimited key=value pairs. Lines starting with the character (\char`\"{}\#\char`\"{}) are considered comments and usually ignored. 303 304A MAF block starts with character (\char`\"{}a\char`\"{}) at the beginning of a line, optionally followed by whitespace delimited key=value pairs. The next lines start with character (\char`\"{}s\char`\"{}) and contain sequence information of the form \begin{DoxyVerb}s src start size strand srcSize sequence 305\end{DoxyVerb} 306 where 307\begin{DoxyItemize} 308\item {\itshape src} is the name of the sequence source 309\item {\itshape start} is the start of the aligned region within the source (0-\/based) 310\item {\itshape size} is the length of the aligned region without gap characters 311\item {\itshape strand} is either (\char`\"{}+\char`\"{}) or (\char`\"{}-\/\char`\"{}), depicting the location of the aligned region relative to the source 312\item {\itshape src\+Size} is the size of the entire sequence source, e.\+g. the full chromosome 313\item {\itshape sequence} is the aligned sequence including gaps depicted by the hyphen (\char`\"{}-\/\char`\"{}) 314\end{DoxyItemize}Here is an example alignment in MAF format (bluntly taken from the \href{https://cgwb.nci.nih.gov/FAQ/FAQformat.html\#format5}{\texttt{ UCSC Genome browser website}})\+: 315\begin{DoxyVerbInclude} 316##maf version=1 scoring=tba.v8 317# tba.v8 (((human chimp) baboon) (mouse rat)) 318# multiz.v7 319# maf_project.v5 _tba_right.maf3 mouse _tba_C 320# single_cov2.v4 single_cov2 /dev/stdin 321 322a score=23262.0 323s hg16.chr7 27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG 324s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG 325s baboon 116834 38 + 4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG 326s mm4.chr6 53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG 327s rn3.chr4 81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG 328 329a score=5062.0 330s hg16.chr7 27699739 6 + 158545518 TAAAGA 331s panTro1.chr6 28862317 6 + 161576975 TAAAGA 332s baboon 241163 6 + 4622798 TAAAGA 333s mm4.chr6 53303881 6 + 151104725 TAAAGA 334s rn3.chr4 81444246 6 + 187371129 taagga 335 336a score=6636.0 337s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca 338s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca 339s baboon 249182 13 + 4622798 gcagctgaaaaca 340s mm4.chr6 53310102 13 + 151104725 ACAGCTGAAAATA 341 342\end{DoxyVerbInclude} 343\hypertarget{file_formats_constraint-formats}{}\doxysubsection{File formats to manipulate the RNA folding grammar}\label{file_formats_constraint-formats} 344\hypertarget{file_formats_constraint-formats-file}{}\doxysubsubsection{Command Files}\label{file_formats_constraint-formats-file} 345The RNAlib and many programs of the Vienna\+RNA Package can parse and apply data from so-\/called command files. These commands may refer to structure constraints or even extensions of the RNA folding grammar (such as \mbox{\hyperlink{group__domains__up}{Unstructured Domains}}). Commands are given as a line of whitespace delimited data fields. The syntax we use extends the constraint definitions used in the \href{http://mfold.rna.albany.edu/?q=mfold}{\texttt{ mfold}} / \href{http://mfold.rna.albany.edu/?q=DINAMelt/software}{\texttt{ UNAfold}} software, where each line begins with a command character followed by a set of positions.~\newline 346However, we introduce several new commands, and allow for an optional loop type context specifier in form of a sequence of characters, and an orientation flag that enables one to force a nucleotide to pair upstream, or downstream.\hypertarget{file_formats_constraint_commands}{}\doxyparagraph{Constraint commands}\label{file_formats_constraint_commands} 347The following set of commands is recognized\+: 348\begin{DoxyItemize} 349\item {\ttfamily F} $ \ldots $ Force 350\item {\ttfamily P} $ \ldots $ Prohibit 351\item {\ttfamily C} $ \ldots $ Conflicts/\+Context dependency 352\item {\ttfamily A} $ \ldots $ Allow (for non-\/canonical pairs) 353\item {\ttfamily E} $ \ldots $ Soft constraints for unpaired position(s), or base pair(s) 354\end{DoxyItemize}\hypertarget{file_formats_domain_commands}{}\doxyparagraph{RNA folding grammar exensions}\label{file_formats_domain_commands} 355 356\begin{DoxyItemize} 357\item {\ttfamily UD} $ \ldots $ Add ligand binding using the \mbox{\hyperlink{group__domains__up}{Unstructured Domains}} feature 358\end{DoxyItemize}\hypertarget{file_formats_command_file_loop_types}{}\doxyparagraph{Specification of the loop type context}\label{file_formats_command_file_loop_types} 359The optional loop type context specifier {\ttfamily }\mbox{[}LOOP\mbox{]} may be a combination of the following\+: 360\begin{DoxyItemize} 361\item {\ttfamily E} $ \ldots $ Exterior loop 362\item {\ttfamily H} $ \ldots $ Hairpin loop 363\item {\ttfamily I} $ \ldots $ Interior loop 364\item {\ttfamily M} $ \ldots $ Multibranch loop 365\item {\ttfamily A} $ \ldots $ All loops 366\end{DoxyItemize} 367 368For structure constraints, we additionally allow one to address base pairs enclosed by a particular kind of loop, which results in the specifier {\ttfamily }\mbox{[}WHERE\mbox{]} which consists of {\ttfamily }\mbox{[}LOOP\mbox{]} plus the following character\+: 369\begin{DoxyItemize} 370\item {\ttfamily i} $ \ldots $ enclosed pair of an Interior loop 371\item {\ttfamily m} $ \ldots $ enclosed pair of a Multibranch loop 372\end{DoxyItemize} 373 374If no {\ttfamily }\mbox{[}LOOP\mbox{]} or {\ttfamily }\mbox{[}WHERE\mbox{]} flags are set, all contexts are considered (equivalent to {\ttfamily A} )\hypertarget{file_formats_const_file_orientation}{}\doxyparagraph{Controlling the orientation of base pairing}\label{file_formats_const_file_orientation} 375For particular nucleotides that are forced to pair, the following {\ttfamily }\mbox{[}ORIENTATION\mbox{]} flags may be used\+: 376\begin{DoxyItemize} 377\item {\ttfamily U} $ \ldots $ Upstream 378\item {\ttfamily D} $ \ldots $ Downstream 379\end{DoxyItemize} 380 381If no {\ttfamily }\mbox{[}ORIENTATION\mbox{]} flag is set, both directions are considered.\hypertarget{file_formats_const_file_seq_coords}{}\doxyparagraph{Sequence coordinates}\label{file_formats_const_file_seq_coords} 382Sequence positions of nucleotides/base pairs are $ 1- $ based and consist of three positions $ i $, $ j $, and $ k $. Alternativly, four positions may be provided as a pair of two position ranges $ [i:j] $, and $ [k:l] $ using the \textquotesingle{}-\/\textquotesingle{} sign as delimiter within each range, i.\+e. $ i-j $, and $ k-l $.\hypertarget{file_formats_const_file_syntax}{}\doxyparagraph{Valid constraint commands}\label{file_formats_const_file_syntax} 383Below are resulting general cases that are considered {\itshape valid} constraints\+: 384 385 386\begin{DoxyEnumerate} 387\item {\bfseries{\char`\"{}\+Forcing a range of nucleotide positions to be paired\char`\"{}}}\+:~\newline 388 Syntax\+: 389\begin{DoxyCode}{0} 390\DoxyCodeLine{F i 0 k [WHERE] [ORIENTATION] } 391 392\end{DoxyCode} 393~\newline 394 Description\+:~\newline 395 Enforces the set of $ k $ consecutive nucleotides starting at position $ i $ to be paired. The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to force them to appear as closing/enclosed pairs of certain types of loops. 396\item {\bfseries{\char`\"{}\+Forcing a set of consecutive base pairs to form\char`\"{}}}\+:~\newline 397 Syntax\+:\begin{DoxyVerb}F i j k [WHERE] \end{DoxyVerb} 398~\newline 399 Description\+:~\newline 400 Enforces the base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $ to form. The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to specify in which loop context the base pair must appear. 401\item {\bfseries{\char`\"{}\+Prohibiting a range of nucleotide positions to be paired\char`\"{}}}\+:~\newline 402 Syntax\+:\begin{DoxyVerb}P i 0 k [WHERE] \end{DoxyVerb} 403~\newline 404 Description\+:~\newline 405 Prohibit a set of $ k $ consecutive nucleotides to participate in base pairing, i.\+e. make these positions unpaired. The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to force the nucleotides to appear within the loop of specific types. 406\item {\bfseries{\char`\"{}\+Probibiting a set of consecutive base pairs to form\char`\"{}}}\+:~\newline 407 Syntax\+:\begin{DoxyVerb}P i j k [WHERE] \end{DoxyVerb} 408~\newline 409 Description\+:~\newline 410 Probibit the base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $ to form. The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to specify the type of loop they are disallowed to be the closing or an enclosed pair of. 411\item {\bfseries{\char`\"{}\+Prohibiting two ranges of nucleotides to pair with each other\char`\"{}}}\+:~\newline 412 Syntax\+:\begin{DoxyVerb}P i-j k-l [WHERE] \end{DoxyVerb} 413 Description\+:~\newline 414 Prohibit any nucleotide $ p \in [i:j] $ to pair with any other nucleotide $ q \in [k:l] $. The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to specify the type of loop they are disallowed to be the closing or an enclosed pair of. 415\item {\bfseries{\char`\"{}\+Enforce a loop context for a range of nucleotide positions\char`\"{}}}\+:~\newline 416 Syntax\+:\begin{DoxyVerb}C i 0 k [WHERE] \end{DoxyVerb} 417 Description\+:~\newline 418 This command enforces nucleotides to be unpaired similar to {\itshape prohibiting} nucleotides to be paired, as described above. It too marks the corresponding nucleotides to be unpaired, however, the {\ttfamily }\mbox{[}WHERE\mbox{]} flag can be used to enforce specfic loop types the nucleotides must appear in. 419\item {\bfseries{\char`\"{}\+Remove pairs that conflict with a set of consecutive base pairs\char`\"{}}}\+:~\newline 420 Syntax\+:\begin{DoxyVerb}C i j k \end{DoxyVerb} 421~\newline 422 Description\+:~\newline 423 Remove all base pairs that conflict with a set of consecutive base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $. Two base pairs $ (i,j) $ and $ (p,q) $ conflict with each other if $ i < p < j < q $, or $ p < i < q < j $. 424\item {\bfseries{\char`\"{}\+Allow a set of consecutive (non-\/canonical) base pairs to form\char`\"{}}}\+:~\newline 425 Syntax\+: 426\begin{DoxyCode}{0} 427\DoxyCodeLine{A i j k [WHERE] } 428 429\end{DoxyCode} 430~\newline 431 Description\+:~\newline 432 This command enables the formation of the consecutive base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $, no matter if they are {\itshape canonical}, or {\itshape non-\/canonical}. In contrast to the above {\ttfamily F} and {\ttfamily W} commands, which remove conflicting base pairs, the {\ttfamily A} command does not. Therefore, it may be used to allow {\itshape non-\/canoncial} base pair interactions. Since the RNAlib does not contain free energy contributions $ E_{ij} $ for non-\/canonical base pairs $ (i,j) $, they are scored as the {\itshape maximum} of similar, known contributions. In terms of a {\itshape Nussinov} like scoring function the free energy of non-\/canonical base pairs is therefore estimated as \[ E_{ij} = \min \left[ \max_{(i,k) \in \{GC, CG, AU, UA, GU, UG\}} E_{ik}, \max_{(k,j) \in \{GC, CG, AU, UA, GU, UG\}} E_{kj} \right]. \] The optional loop type specifier {\ttfamily }\mbox{[}WHERE\mbox{]} allows to specify in which loop context the base pair may appear. 433\item {\bfseries{\char`\"{}\+Apply pseudo free energy to a range of unpaired nucleotide positions\char`\"{}}}\+:~\newline 434 Syntax\+: 435\begin{DoxyCode}{0} 436\DoxyCodeLine{E i 0 k e } 437 438\end{DoxyCode} 439~\newline 440 Description\+:~\newline 441 Use this command to apply a pseudo free energy of $ e $ to the set of $ k $ consecutive nucleotides, starting at position $ i $. The pseudo free energy is applied only if these nucleotides are considered unpaired in the recursions, or evaluations, and is expected to be given in $ kcal / mol $. 442\item {\bfseries{\char`\"{}\+Apply pseudo free energy to a set of consecutive base pairs\char`\"{}}}\+:~\newline 443 Syntax 444\begin{DoxyCode}{0} 445\DoxyCodeLine{E i j k e } 446 447\end{DoxyCode} 448~\newline 449 Use this command to apply a pseudo free energy of $ e $ to the set of base pairs $ (i,j), \ldots, (i+(k-1), j-(k-1)) $. Energies are expected to be given in $ kcal / mol $. 450\end{DoxyEnumerate}\hypertarget{file_formats_domains_syntax}{}\doxyparagraph{Valid domain extensions commands}\label{file_formats_domains_syntax} 451 452\begin{DoxyEnumerate} 453\item {\bfseries{\char`\"{}\+Add ligand binding to unpaired motif (a.\+k.\+a. unstructured domains)\char`\"{}}}\+:~\newline 454 Syntax\+: 455\begin{DoxyCode}{0} 456\DoxyCodeLine{UD m e [LOOP] } 457 458\end{DoxyCode} 459~\newline 460 Description\+:~\newline 461 Add ligand binding to unpaired sequence motif $ m $ (given in IUPAC format, capital letters) with binding energy $ e $ in particular loop type(s).~\newline 462 Example\+: 463\begin{DoxyCode}{0} 464\DoxyCodeLine{UD AAA -\/5.0 A} 465 466\end{DoxyCode} 467~\newline 468 The above example applies a binding free energy of $ -5 kcal/mol $ for a motif AAA that may be present in all loop types. 469\end{DoxyEnumerate}\hypertarget{plots}{}\doxysection{Plotting}\label{plots} 470Create Plots of Secondary Structures, Feature Motifs, and Sequence Alignments\hypertarget{plots_utils_ss}{}\doxysubsection{Producing secondary structure graphs}\label{plots_utils_ss} 471\begin{DoxyVerb}int PS_rna_plot ( char *string, 472 char *structure, 473 char *file) 474\end{DoxyVerb} 475 Produce a secondary structure graph in Post\+Script and write it to \textquotesingle{}filename\textquotesingle{}. 476 477\begin{DoxyVerb}int PS_rna_plot_a ( 478 char *string, 479 char *structure, 480 char *file, 481 char *pre, 482 char *post) 483\end{DoxyVerb} 484 Produce a secondary structure graph in Post\+Script including additional annotation macros and write it to \textquotesingle{}filename\textquotesingle{}. 485 486\begin{DoxyVerb}int gmlRNA (char *string, 487 char *structure, 488 char *ssfile, 489 char option) 490\end{DoxyVerb} 491 Produce a secondary structure graph in Graph Meta Language (gml) and write it to a file. 492 493\begin{DoxyVerb}int ssv_rna_plot (char *string, 494 char *structure, 495 char *ssfile) 496\end{DoxyVerb} 497 Produce a secondary structure graph in SStruct\+View format. 498 499\begin{DoxyVerb}int svg_rna_plot (char *string, 500 char *structure, 501 char *ssfile) 502\end{DoxyVerb} 503 Produce a secondary structure plot in SVG format and write it to a file. 504 505\begin{DoxyVerb}int xrna_plot ( char *string, 506 char *structure, 507 char *ssfile) 508\end{DoxyVerb} 509 Produce a secondary structure plot for further editing in XRNA. 510 511\begin{DoxyVerb}int rna_plot_type 512\end{DoxyVerb} 513 Switch for changing the secondary structure layout algorithm. 514 515Two low-\/level functions provide direct access to the graph lauyouting algorithms\+: 516 517\begin{DoxyVerb}int simple_xy_coordinates ( short *pair_table, 518 float *X, 519 float *Y) 520\end{DoxyVerb} 521 Calculate nucleotide coordinates for secondary structure plot the {\itshape Simple way} 522 523\begin{DoxyVerb}int naview_xy_coordinates ( short *pair_table, 524 float *X, 525 float *Y) 526\end{DoxyVerb} 527 528 529\begin{DoxySeeAlso}{See also} 530\mbox{\hyperlink{PS__dot_8h}{PS\+\_\+dot.\+h}} and naview.\+h for more detailed descriptions. 531\end{DoxySeeAlso} 532\hypertarget{plots_utils_dot}{}\doxysubsection{Producing (colored) dot plots for base pair probabilities}\label{plots_utils_dot} 533\begin{DoxyVerb}int PS_color_dot_plot ( char *string, 534 cpair *pi, 535 char *filename) 536\end{DoxyVerb} 537 538 539\begin{DoxyVerb}int PS_color_dot_plot_turn (char *seq, 540 cpair *pi, 541 char *filename, 542 int winSize) 543\end{DoxyVerb} 544 545 546\begin{DoxyVerb}int PS_dot_plot_list (char *seq, 547 char *filename, 548 plist *pl, 549 plist *mf, 550 char *comment) 551\end{DoxyVerb} 552 Produce a postscript dot-\/plot from two pair lists. 553 554\begin{DoxyVerb}int PS_dot_plot_turn (char *seq, 555 struct plist *pl, 556 char *filename, 557 int winSize) 558\end{DoxyVerb} 559 560 561\begin{DoxySeeAlso}{See also} 562\mbox{\hyperlink{PS__dot_8h}{PS\+\_\+dot.\+h}} for more detailed descriptions. 563\end{DoxySeeAlso} 564\hypertarget{plots_utils_aln}{}\doxysubsection{Producing (colored) alignments}\label{plots_utils_aln} 565\begin{DoxyVerb}int PS_color_aln ( 566 const char *structure, 567 const char *filename, 568 const char *seqs[], 569 const char *names[]) 570\end{DoxyVerb} 571 Produce Post\+Script sequence alignment color-\/annotated by consensus structure.