1=head1 NAME
2
3RNA - interface to the ViennaRNA C-library (libRNA.a)
4
5=head1 SYNOPSIS
6
7  use RNA;
8  $seq = "CGCAGGGAUACCCGCG";
9  ($struct, $mfe) = RNA::fold($seq);  #predict mfe structure of $seq
10  RNA::PS_rna_plot($seq, $struct, "rna.ps");  # write PS plot to rna.ps
11  $F = RNA::pf_fold($seq);   # compute partition function and pair pobabilities
12  RNA::PS_dot_plot($seq, "dot.ps");          # write dot plot to dot.ps
13  ...
14
15=head1 DESCRIPTION
16
17The RNA.pm package gives access to almost all functions in the libRNA.a
18library of the Vienna RNA PACKAGE. The Perl wrapper is generated using
19SWIG http://www.swig.org/ with relatively little manual intervention.
20For each C function in the library the perl package provides a function
21of the same name and calling convention (with few exceptions). For
22detailed information you should therefore also consult the documentation
23of the library (info RNAlib).
24
25Note that in general C arrays are wrapped into opaque objects that can
26only be accessed via helper functions. SWIG provides a couple of general
27purpose helper functions, see the section at the end of this file. C
28structures are wrapped into Perl objects using SWIG's shadow class
29mechanism, resulting in a tied hash with keys named after the structure
30members.
31
32For the interrested reader we list for each scalar type of the
33corepsonding C variable in brackets, and point out the header files
34containing the C declaration.
35
36=head2 Folding Routines
37
38Minimum free Energy Folding (from fold.h)
39
40=over 4
41
42=item fold SEQUENCE
43
44=item fold SEQUENCE, CONSTRAINTS
45
46computes the minimum free energy structure of the string SEQUENCE and returns
47the predicted structure and energy, e.g.
48
49  ($structure, $mfe) = RNA::fold("UGUGUCGAUGUGCUAU");
50
51If a second argument is supplied and
52L<$fold_constrained|/$fold_constrained>==1 the CONSTRAINTS string is
53used to specify constraints on the predicted structure.  The
54characters '|', 'x', '<', '>' mark bases that are paired, unpaired,
55paired upstream, or downstream, respectively; matching brackets "( )"
56denote base pairs, dots '.' are used for unconstrained bases.
57
58In the two argument version the CONSTRAINTS string is modified and holds the
59predicted structure upon return. This is done for backwards compatibility only,
60and might change in future versions.
61
62=item energy_of_struct SEQUENCE, STRUCTURE
63
64returns the energy of SEQUENCE on STRUCTURE (in kcal/mol). The string structure
65must hold a valid secondary structure in bracket notation.
66
67=item update_fold_params
68
69recalculate the pair matrix and energy parameters after a change in folding
70parameters. In many cases (such as changes to
71L<$temperature|/$temperature>) the fold() routine will call
72update_fold_params automatically when necessary.
73
74=item free_arrays
75
76frees memory allocated internally when calling L<fold|/fold>.
77
78
79=item cofold SEQUENCE
80
81=item cofold SEQUENCE, CONSTRAINTS
82
83works as fold, but SEQUENCE may be the concatenation of two RNAs in order
84compute their hybridization structure. E.g.:
85
86  $seq1  ="CGCAGGGAUACCCGCG";
87  $seq2  ="GCGCCCAUAGGGACGC";
88  $RNA::cut_point = length($seq1)+1;
89  ($costruct, $comfe) = RNA::cofold($seq1 . $seq2);
90
91=item duplexfold SEQ1 SEQ2
92
93compute the structure upon hybridization of SEQ1 and SEQ2. In contrast to
94cofold only intra-molecular pairs are allowed. Thus, the algorithm runs in
95O(n1*n2) time where n1 and n2 are the lengths of the sequences. The result
96is returned in a C struct containing the innermost base pair (i,j) the
97structure and energy. E.g:
98
99  $seq1 ="CGCAGGGAUACCCGCG";
100  $seq2 ="GCGCCCAUAGGGACGC";
101  $dup  = RNA::duplexfold($seq1, $seq2);
102  print "Region ", $dup->{i}+1-length($seq1), " to ",
103        $dup->{i}, " of seq1 ",
104        "pairs up with ", $dup->{j}, " to ",
105        $dup->{j}+length($dup->{structure})-length($seq1)-2,
106        " of seq2\n";
107
108=back
109
110Partition function Folding (from part_func.h)
111
112=over 4
113
114=item pf_fold SEQUENCE
115
116=item pf_fold SEQUENCE, CONSTRAINTS
117
118calculates the partition function over all possible secondary
119structures and the matrix of pair probabilities for SEQUENCE and
120returns a two element list consisting of a string summarizing possible
121structures. See below on how to access the pair probability matrix. As
122with L<fold|/fold> the second argument can be used to specify folding
123constraints. Constraints are implemented by excluding base pairings
124that contradict the constraint, but without bonus
125energies. Constraints of type '|' (paired base) are ignored.  In the
126two argument version CONSTRAINTS is modified to contain the structure
127string on return (obsolete feature, for backwards compatibility only)
128
129=item get_pr I, J
130
131After calling C<pf_fold> the global C variable C<pr> points to the
132computed pair probabilities. Perl access to the C is facilitated by
133the C<get_pr> helper function that looks up and returns the
134probability of the pair (I,J).
135
136=item free_pf_arrays
137
138frees memory allocated for pf_fold
139
140=item update_pf_params LENGTH
141
142recalculate energy parameters for pf_fold. In most cases (such as
143simple changes to L<$temperature|/$temperature>) C<pf_fold>
144will take appropriate action automatically.
145
146=item pbacktrack SEQUENCE
147
148return a random structure chosen according to it's Boltzmann probability.
149Use to produce samples representing the thermodynamic ensemble of
150structures.
151
152  RNA::pf_fold($sequence);
153  for (1..1000) {
154     push @sample, RNA::pbacktrack($sequence);
155  }
156
157=item co_pf_fold SEQUENCE
158
159=item co_pf_fold SEQUENCE, CONSTRAINTS
160
161calculates the partition function over all possible secondary
162structures and the matrix of pair probabilities for SEQUENCE.
163SEQUENCE is a concatenation of two sequences (see cofold).
164Returns a five element list consisting of a string summarizing possible
165structures as first element. The second element is the Gibbs free energy of Sequence 1 (as computed also with pf_fold), the third element the Gibbs free energy of Sequence 2. The fourth element is the Gibbs free energy of all structures that have INTERmolecular base pairs, and finally the fifth element is the Gibbs free energy of the whole ensemble (dimers as well as monomers).
166See above on how to access the pair probability matrix. As
167with L<fold|/fold> the second argument can be used to specify folding
168constraints. Constraints are implemented by excluding base pairings
169that contradict the constraint, but without bonus
170energies. Constraints of type '|' (paired base) are ignored.  In the
171two argument version CONSTRAINTS is modified to contain the structure
172string on return (obsolete feature, for backwards compatibility only)
173
174=item free_co_pf_arrays
175
176frees memory allocated for co_pf_fold
177
178=item update_pf_co_params LENGTH
179
180recalculate energy parameters for co_pf_fold. In most cases (such as
181simple changes to L<$temperature|/$temperature>) C<co_pf_fold>
182will take appropriate action automatically.
183
184=item get_concentrations FdAB, FdAA, FdBB, FA, FB, CONCA, CONCB
185
186calculates equilibrium concentrations of the three dimers AB, AA, and BB, as well as the two monomers A and B out of the free energies of the duplexes (FdAB, FdAA, FdBB, these are the fourth elements returned by co_pf_fold), the monomers (FA, FB (e.g. the second and third elements returned by co_pf_fold with sequences AB) and the start concentrations of A and B. It returns as first element the concentration of AB dimer, than AA and BB dimer, as fourth element the A monomer concentration, and as fifth and last element the B monomer concentration.
187So, to compute concentrations, you first have to run 3 co_pf_folds (with sequences AB, AA and BB).
188
189=back
190
191Suboptimal Folding (from subopt.h)
192
193=over 4
194
195=item subopt SEQUENCE, CONSTRAINTS, DELTA
196
197=item subopt SEQUENCE, CONSTRAINTS, DELTA, FILEHANDLE
198
199compute all structures of SEQUENCE within DELTA*0.01 kcal/mol of the
200optimum. If specified, results are written to FILEHANDLE and nothing
201is returned. Else, the C function returnes a list of C structs of type
202SOLUTION. The list is wrapped by SWIG as a perl object that can be
203accesses as follows:
204
205  $solution = subopt($seq, undef, 500);
206  for (0..$solution->size()-1) {
207     printf "%s %6.2f\n",  $solution->get($_)->{structure},
208			   $solution->get($_)->{energy};
209  }
210
211=back
212
213Alignment Folding (from alifold.h)
214
215=over 4
216
217=item alifold REF
218
219=item fold REF, CONSTRAINTS
220
221similar to fold() but compute the consensus structure for a set of aligned
222sequences. E.g.:
223
224  @align = ("GCCAUCCGAGGGAAAGGUU",
225	    "GAUCGACAGCGUCU-AUCG",
226	    "CCGUCUUUAUGAGUCCGGC");
227  ($consens_struct, $consens_en) = RNA::alifold(\@align);
228
229=item consensus REF
230=item consens_mis REF
231
232compute a simple consensus sequence or "most informative sequence" form an
233alignment. The simple consensus returns the most frequent character for
234each column, the MIS uses the IUPAC symbol that contains all characters
235that are overrepresented in the column.
236
237  $mis = consensus_mis(\@align);
238
239
240=back
241
242Inverse Folding (from inverse.h)
243
244=over 4
245
246=item inverse_fold START, TARGET
247
248find a sequence that folds into structure TARGET, by optimizing the
249sequence until its mfe structure (as returned by L<fold|/fold>) is
250TARGET. Startpoint of the optimization is the sequence START. Returns
251a list containing the sequence found and the final value of the cost
252function, i.e. 0 if the search was successful. A random start sequence
253can be generated using L<random_string|/random_string>.
254
255=item inverse_pf_fold START, TARGET
256
257optimizes a sequence (beginning with START) by maximising the
258frequency of the structure TARGET in the thermodynamic ensemble
259of structures. Returns a list containing the optimized sequence and
260the final value of the cost function. The cost function is given by
261C<energy_of_struct(seq, TARGET) - pf_fold(seq)>, i.e.C<-RT*log(p(TARGET))>
262
263=item $final_cost [float]
264
265holds the value of the cost function where the optimization in
266C<inverse_pf_fold> should stop. For values <=0 the optimization will
267only terminate at a local optimimum (which might take very long to reach).
268
269=item $symbolset [char *]
270
271the string symbolset holds the allowed characters to be used by
272C<inverse_fold> and C<inverse_pf_fold>, the default alphabet is "AUGC"
273
274
275=item $give_up [int]
276
277If non-zero stop optimization when its clear that no exact solution
278can be found. Else continue and eventually return an approximate
279solution. Default 0.
280
281=back
282
283Cofolding of two RNA molecules (from cofold.h)
284
285=over 4
286
287
288=back
289
290Global Variables to Modify Folding (from fold_vars.h)
291
292=over 4
293
294=item $noGU [int]
295
296Do not allow GU pairs to form, default 0.
297
298=item $no_closingGU [int]
299
300allow GU only inside stacks, default 0.
301
302=item $tetra_loop [int]
303
304Fold with specially stable 4-loops, default 1.
305
306=item $energy_set [int]
307
3080 = BP; 1=any mit GC; 2=any mit AU-parameter, default 0.
309
310=item $dangles [int]
311
312How to compute dangling ends. 0: no dangling end energies, 1: "normal"
313dangling ends (default), 2: simplified dangling ends, 3: "normal" +
314co-axial stacking. Note that L<pf_fold|/pf_fold> treats cases 1 and 3
315as 2. The same holds for the main computation in L<subopt|/subopt>,
316however subopt will re-evalute energies using
317L<energy_of_struct|energy_of_struct> for cases 1 and 3. See the more
318detailed discussion in RNAlib.texinfo.
319
320=item $nonstandards [char *]
321
322contains allowed non standard bases, default empty string ""
323
324=item $temperature [double]
325
326temperature in degrees Celsius for rescaling parameters, default 37C.
327
328=item $logML [int]
329
330use logarithmic multiloop energy function in
331L<energy_of_struct|/energy_of_struct>, default 0.
332
333=item $noLonelyPairs [int]
334
335consider only structures without isolated base pairs (helices of length 1).
336For L<pf_fold|/pf_fold> only eliminates pairs
337that can B<only> occur as isolated pairs. Default 0.
338
339=item $base_pair [struct bond *]
340
341list of base pairs from last call to L<fold|/fold>. Better use
342the structure string returned by  L<fold|/fold>.
343
344=item $pf_scale [double]
345
346scaling factor used by L<pf_fold|/pf_fold> to avoid overflows. Should
347be set to exp(-F/(RT*length)) where F is a guess for the ensmble free
348energy (e.g. use the mfe).
349
350
351=item $fold_constrained [int]
352
353apply constraints in the folding algorithms, default 0.
354
355=item $do_backtrack [int]
356
357If 0 do not compute the pair probabilities in L<pf_fold|/pf_fold>
358(only the partition function). Default 1.
359
360=item $backtrack_type [char]
361
362usually 'F'; 'C' require (1,N) to be bonded; 'M' backtrack as if the
363sequence was part of a multi loop. Used by L<inverse_fold|/inverse_fold>
364
365=item $pr [double *]
366
367the base pairing prob. matrix computed by L<pf_fold|/pf_fold>.
368
369=item $iindx [int *]
370
371Array of indices for moving withing the C<pr> array. Better use
372L<get_pr|/get_pr>.
373
374=back
375
376from move_set.h
377
378=over 4
379
380=item move_standard SEQUENCE, STRUCTURE, MOVE_TYPE, VERBOSITY, SHIFTS, noLP
381
382Walking method to find local minima. There are three different kinds available
383which can be cosen using the MOVE_TYPE enum:
3840 - GRADIENT: take the neighbouring structure with the lowest energy
3851 - FIRST: take the first neighbour with a lower energy
3862 - ADAPTIVE: randomly choose a neighbour
387STRUCTURE is the start structure and will also be used to return the target structure.
388Others are options set as integers.
389
390=back
391
392=head2 Parsing and Comparing Structures
393
394from RNAstruct.h: these functions convert between strings
395representating secondary structures with various levels of coarse
396graining. See the documentation of the C library for details
397
398=over 4
399
400=item b2HIT STRUCTURE
401
402Full -> HIT [incl. root]
403
404=item b2C STRUCTURE
405
406Full -> Coarse [incl. root]
407
408=item b2Shapiro STRUCTURE
409
410Full -> weighted Shapiro [i.r.]
411
412=item add_root STRUCTURE
413
414{Tree} -> ({Tree}R)
415
416=item expand_Shapiro COARSE
417
418add S for stacks to coarse struct
419
420=item expand_Full STRUCTURE
421
422Full -> FFull
423
424=item unexpand_Full FSTRUCTURE
425
426FFull -> Full
427
428=item unweight WCOARSE
429
430remove weights from coarse struct
431
432=item unexpand_aligned_F ALIGN
433
434
435
436=item parse_structure STRUCTURE
437
438computes structure statistics, and fills the following global variables:
439
440$loops    [int] number of loops (and stacks)
441$unpaired [int] number of unpaired positions
442$pairs    [int] number of paired positions
443$loop_size[int *]  holds all loop sizes
444$loop_degree[int *] holds all loop degrees
445$helix_size[int *] holds all helix lengths
446
447=back
448
449from treedist.h: routines for computing tree-edit distances between structures
450
451=over 4
452
453=item make_tree XSTRUCT
454
455convert a structure string as produced by the expand_... functions to a
456Tree, useable as input to tree_edit_distance.
457
458=item tree_edit_distance T1, T2
459
460compare to structures using tree editing. C<T1>, C<T2> must have been
461created using C<tree_edit_distance>
462
463=item print_tree T
464
465mainly for debugging
466
467=item free_tree T
468
469free space allocated by make_tree
470
471=back
472
473from stringdist.h routines to compute structure distances via string-editing
474
475=over 4
476
477=item Make_swString STRUCTURE
478
479[ returns swString * ]
480make input for string_edit_distance
481
482=item string_edit_distance S1, S2
483
484[ returns float  ]
485compare to structures using string alignment. C<S1>, C<S2> should be
486created using C<Make_swString>
487
488=back
489
490from profiledist
491
492=over
493
494=item Make_bp_profile LENGTH
495
496[ returns (float *) ]
497condense pair probability matrix C<pr> into a vector containing
498probabilities for unpaired, upstream paired and downstream paired.
499This resulting probability profile is used as input for
500profile_edit_distance
501
502=item profile_edit_distance T1, T2
503
504[ returns float ]
505align two probability profiles produced by C<Make_bp_profile>
506
507=item print_bppm T
508
509[ returns void ]
510print string representation of probability profile
511
512=item free_profile T
513
514[ returns void ]
515free space allocated in Make_bp_profile
516
517=back
518
519Global variables for computing structure distances
520
521=over 4
522
523=item $edit_backtrack [int]
524
525set to 1 if you want backtracking
526
527=item $aligned_line [(char *)[2]]
528
529containes alignmed structures after computing structure distance with
530C<edit_backtrack==1>
531
532=item $cost_matrix [int]
533
5340 usual costs (default), 1 Shapiro's costs
535
536=back
537
538=head2 Utilities (from utils.h)
539
540=over 4
541
542=item space SIZE
543
544allocate memory from C. Usually not needed in Perl
545
546=item nrerror MESSGAE
547
548die with error message. Better use Perl's C<die>
549
550=item $xsubi [unsigned short[3]]
551
552libRNA uses the rand48 48bit random number generator if available, the
553current random  number is always stored in $xsubi.
554
555=item init_rand
556
557initialize the $xsubi random number from current time
558
559=item urn
560
561returns a random number between 0 and 1 using the random number
562generator from the RNA library.
563
564=item int_urn FROM, TO
565
566returns random integer in the range [FROM..TO]
567
568=item time_stamp
569
570current date in a string. In perl you might as well use C<locatime>
571
572=item random_string LENGTH, SYMBOLS
573
574returns a string of length LENGTH using characters from the string
575SYMBOLS
576
577=item hamming S1, S2
578
579calculate hamming distance of the strings C<S1> and C<S2>.
580
581
582=item pack_structure STRUCTURE
583
584pack secondary structure, using a 5:1 compression via 3
585encoding. Returns the packed string.
586
587=item unpack_structure PACKED
588
589unpacks a secondary structure packed with pack_structure
590
591=item make_pair_table STRUCTURE
592
593returns a pair table as a newly allocated (short *) C array, such
594that: table[i]=j if (i.j) pair or 0 if i is unpaired, table[0]
595contains the length of the structure.
596
597=item bp_distance STRUCTURE1, STRUCTURE2
598
599returns the base pair distance of the two STRUCTURES. dist = {number
600of base pairs in one structure but not in the other} same as edit
601distance with open-pair close-pair as move-set
602
603=back
604
605from PS_plot.h
606
607=over 4
608
609=item PS_rna_plot SEQUENCE, STRUCTURE, FILENAME
610
611write PostScript drawing of structure to FILENAME. Returns 1 on
612sucess, 0 else.
613
614=item PS_rna_plot_a SEQUENCE, STRUCTURE, FILENAME, PRE, POST
615
616write PostScript drawing of structure to FILENAME. The strings PRE and
617POST contain PostScript code that is included verbatim in the plot just
618before (after) the data.  Returns 1 on sucess, 0 else.
619
620=item gmlRNA SEQUENCE, STRUCTURE, FILENAME, OPTION
621
622write structure drawing in gml (Graph Meta Language) to
623FILENAME. OPTION should be a single character. If uppercase the gml
624output will include the SEQUENCE as node labels. IF OPTION equal 'x'
625or 'X' write graph with coordinates (else only connectivity
626information). Returns 1 on sucess, 0 else.
627
628=item ssv_rna_plot SEQUENCE, STRUCTURE, SSFILE
629
630write structure drfawing as coord file for SStructView Returns 1 on
631sucess, 0 else.
632
633=item xrna_plot SEQUENCE, STRUCTURE, SSFILE
634
635write structure drawing as ".ss" file for further editing in XRNA.
636Returns 1 on sucess, 0 else.
637
638=item PS_dot_plot SEQUENCE, FILENAME
639
640write a PostScript dot plot of the pair probability matix to
641FILENAME. Returns 1 on sucess, 0 else.
642
643=item $rna_plot_type [int]
644
645Select layout algorithm for structure drawings. Currently available
6460= simple coordinates, 1= naview, default 1.
647
648=back
649
650from read_epars.c
651
652=over 4
653
654=item read_parameter_file FILENAME
655
656read energy parameters from FILENAME
657
658=item write_parameter_file FILENAME
659
660write energy parameters to FILENAME
661
662=back
663
664=head2 SWIG helper functions
665
666The package includes generic helper functions to access C arrays
667of type C<int>, C<float> and C<double>, such as:
668
669=over 4
670
671=item intP_getitem POINTER, INDEX
672
673return the element INDEX from the array
674
675=item intP_setitem POINTER, INDEX, VALUE
676
677set element INDEX to VALUE
678
679=item new_intP NELEM
680
681allocate a new C array of integers with NELEM elements and return the pointer
682
683=item delete_intP POINTER
684
685deletes the C array by calling free()
686
687=back
688
689substituting C<intP> with C<floatP>, C<doubleP>, C<ushortP>,
690C<shortP>, gives the corresponding functions for arrays of float or
691double, unsigned short, and short. You need to know the correct C
692type however, and the functions work only for arrays of simple types.
693Note, that the shortP... functions were used for unsigned short in previous
694versions, while starting with v1.8.3 it can only access signed short arrays.
695
696On the lowest level the C<cdata> function gives direct access to any data
697in the form of a Perl string.
698
699=over
700
701=item cdata POINTER, SIZE
702
703copies SIZE bytes at POINTER to a Perl string (with binary data)
704
705=item memmove POINTER, STRING
706
707copies the (binary) string STRING to the memory location pointed to by
708POINTER.
709Note: memmove is broken in current swig versions (e.g. 1.3.31)
710
711=back
712
713In combination with Perl's C<unpack> this provides a generic way to convert
714C data structures to Perl. E.g.
715
716  RNA::parse_structure($structure);  # fills the $RNA::loop_degree array
717  @ldegrees = unpack "I*", RNA::cdata($RNA::loop_degree, ($RNA::loops+1)*4);
718
719Warning: using these functions with wrong arguments will corrupt your
720memory and lead to a segmentation fault.
721
722=head1 AUTHOR
723
724Ivo L. Hofacker <ivo@tbi.univie.ac.at>
725
726=cut
727