1=head1 NAME 2 3RNA - interface to the ViennaRNA C-library (libRNA.a) 4 5=head1 SYNOPSIS 6 7 use RNA; 8 $seq = "CGCAGGGAUACCCGCG"; 9 ($struct, $mfe) = RNA::fold($seq); #predict mfe structure of $seq 10 RNA::PS_rna_plot($seq, $struct, "rna.ps"); # write PS plot to rna.ps 11 $F = RNA::pf_fold($seq); # compute partition function and pair pobabilities 12 RNA::PS_dot_plot($seq, "dot.ps"); # write dot plot to dot.ps 13 ... 14 15=head1 DESCRIPTION 16 17The RNA.pm package gives access to almost all functions in the libRNA.a 18library of the Vienna RNA PACKAGE. The Perl wrapper is generated using 19SWIG http://www.swig.org/ with relatively little manual intervention. 20For each C function in the library the perl package provides a function 21of the same name and calling convention (with few exceptions). For 22detailed information you should therefore also consult the documentation 23of the library (info RNAlib). 24 25Note that in general C arrays are wrapped into opaque objects that can 26only be accessed via helper functions. SWIG provides a couple of general 27purpose helper functions, see the section at the end of this file. C 28structures are wrapped into Perl objects using SWIG's shadow class 29mechanism, resulting in a tied hash with keys named after the structure 30members. 31 32For the interrested reader we list for each scalar type of the 33corepsonding C variable in brackets, and point out the header files 34containing the C declaration. 35 36=head2 Folding Routines 37 38Minimum free Energy Folding (from fold.h) 39 40=over 4 41 42=item fold SEQUENCE 43 44=item fold SEQUENCE, CONSTRAINTS 45 46computes the minimum free energy structure of the string SEQUENCE and returns 47the predicted structure and energy, e.g. 48 49 ($structure, $mfe) = RNA::fold("UGUGUCGAUGUGCUAU"); 50 51If a second argument is supplied and 52L<$fold_constrained|/$fold_constrained>==1 the CONSTRAINTS string is 53used to specify constraints on the predicted structure. The 54characters '|', 'x', '<', '>' mark bases that are paired, unpaired, 55paired upstream, or downstream, respectively; matching brackets "( )" 56denote base pairs, dots '.' are used for unconstrained bases. 57 58In the two argument version the CONSTRAINTS string is modified and holds the 59predicted structure upon return. This is done for backwards compatibility only, 60and might change in future versions. 61 62=item energy_of_struct SEQUENCE, STRUCTURE 63 64returns the energy of SEQUENCE on STRUCTURE (in kcal/mol). The string structure 65must hold a valid secondary structure in bracket notation. 66 67=item update_fold_params 68 69recalculate the pair matrix and energy parameters after a change in folding 70parameters. In many cases (such as changes to 71L<$temperature|/$temperature>) the fold() routine will call 72update_fold_params automatically when necessary. 73 74=item free_arrays 75 76frees memory allocated internally when calling L<fold|/fold>. 77 78 79=item cofold SEQUENCE 80 81=item cofold SEQUENCE, CONSTRAINTS 82 83works as fold, but SEQUENCE may be the concatenation of two RNAs in order 84compute their hybridization structure. E.g.: 85 86 $seq1 ="CGCAGGGAUACCCGCG"; 87 $seq2 ="GCGCCCAUAGGGACGC"; 88 $RNA::cut_point = length($seq1)+1; 89 ($costruct, $comfe) = RNA::cofold($seq1 . $seq2); 90 91=item duplexfold SEQ1 SEQ2 92 93compute the structure upon hybridization of SEQ1 and SEQ2. In contrast to 94cofold only intra-molecular pairs are allowed. Thus, the algorithm runs in 95O(n1*n2) time where n1 and n2 are the lengths of the sequences. The result 96is returned in a C struct containing the innermost base pair (i,j) the 97structure and energy. E.g: 98 99 $seq1 ="CGCAGGGAUACCCGCG"; 100 $seq2 ="GCGCCCAUAGGGACGC"; 101 $dup = RNA::duplexfold($seq1, $seq2); 102 print "Region ", $dup->{i}+1-length($seq1), " to ", 103 $dup->{i}, " of seq1 ", 104 "pairs up with ", $dup->{j}, " to ", 105 $dup->{j}+length($dup->{structure})-length($seq1)-2, 106 " of seq2\n"; 107 108=back 109 110Partition function Folding (from part_func.h) 111 112=over 4 113 114=item pf_fold SEQUENCE 115 116=item pf_fold SEQUENCE, CONSTRAINTS 117 118calculates the partition function over all possible secondary 119structures and the matrix of pair probabilities for SEQUENCE and 120returns a two element list consisting of a string summarizing possible 121structures. See below on how to access the pair probability matrix. As 122with L<fold|/fold> the second argument can be used to specify folding 123constraints. Constraints are implemented by excluding base pairings 124that contradict the constraint, but without bonus 125energies. Constraints of type '|' (paired base) are ignored. In the 126two argument version CONSTRAINTS is modified to contain the structure 127string on return (obsolete feature, for backwards compatibility only) 128 129=item get_pr I, J 130 131After calling C<pf_fold> the global C variable C<pr> points to the 132computed pair probabilities. Perl access to the C is facilitated by 133the C<get_pr> helper function that looks up and returns the 134probability of the pair (I,J). 135 136=item free_pf_arrays 137 138frees memory allocated for pf_fold 139 140=item update_pf_params LENGTH 141 142recalculate energy parameters for pf_fold. In most cases (such as 143simple changes to L<$temperature|/$temperature>) C<pf_fold> 144will take appropriate action automatically. 145 146=item pbacktrack SEQUENCE 147 148return a random structure chosen according to it's Boltzmann probability. 149Use to produce samples representing the thermodynamic ensemble of 150structures. 151 152 RNA::pf_fold($sequence); 153 for (1..1000) { 154 push @sample, RNA::pbacktrack($sequence); 155 } 156 157=item co_pf_fold SEQUENCE 158 159=item co_pf_fold SEQUENCE, CONSTRAINTS 160 161calculates the partition function over all possible secondary 162structures and the matrix of pair probabilities for SEQUENCE. 163SEQUENCE is a concatenation of two sequences (see cofold). 164Returns a five element list consisting of a string summarizing possible 165structures as first element. The second element is the Gibbs free energy of Sequence 1 (as computed also with pf_fold), the third element the Gibbs free energy of Sequence 2. The fourth element is the Gibbs free energy of all structures that have INTERmolecular base pairs, and finally the fifth element is the Gibbs free energy of the whole ensemble (dimers as well as monomers). 166See above on how to access the pair probability matrix. As 167with L<fold|/fold> the second argument can be used to specify folding 168constraints. Constraints are implemented by excluding base pairings 169that contradict the constraint, but without bonus 170energies. Constraints of type '|' (paired base) are ignored. In the 171two argument version CONSTRAINTS is modified to contain the structure 172string on return (obsolete feature, for backwards compatibility only) 173 174=item free_co_pf_arrays 175 176frees memory allocated for co_pf_fold 177 178=item update_pf_co_params LENGTH 179 180recalculate energy parameters for co_pf_fold. In most cases (such as 181simple changes to L<$temperature|/$temperature>) C<co_pf_fold> 182will take appropriate action automatically. 183 184=item get_concentrations FdAB, FdAA, FdBB, FA, FB, CONCA, CONCB 185 186calculates equilibrium concentrations of the three dimers AB, AA, and BB, as well as the two monomers A and B out of the free energies of the duplexes (FdAB, FdAA, FdBB, these are the fourth elements returned by co_pf_fold), the monomers (FA, FB (e.g. the second and third elements returned by co_pf_fold with sequences AB) and the start concentrations of A and B. It returns as first element the concentration of AB dimer, than AA and BB dimer, as fourth element the A monomer concentration, and as fifth and last element the B monomer concentration. 187So, to compute concentrations, you first have to run 3 co_pf_folds (with sequences AB, AA and BB). 188 189=back 190 191Suboptimal Folding (from subopt.h) 192 193=over 4 194 195=item subopt SEQUENCE, CONSTRAINTS, DELTA 196 197=item subopt SEQUENCE, CONSTRAINTS, DELTA, FILEHANDLE 198 199compute all structures of SEQUENCE within DELTA*0.01 kcal/mol of the 200optimum. If specified, results are written to FILEHANDLE and nothing 201is returned. Else, the C function returnes a list of C structs of type 202SOLUTION. The list is wrapped by SWIG as a perl object that can be 203accesses as follows: 204 205 $solution = subopt($seq, undef, 500); 206 for (0..$solution->size()-1) { 207 printf "%s %6.2f\n", $solution->get($_)->{structure}, 208 $solution->get($_)->{energy}; 209 } 210 211=back 212 213Alignment Folding (from alifold.h) 214 215=over 4 216 217=item alifold REF 218 219=item fold REF, CONSTRAINTS 220 221similar to fold() but compute the consensus structure for a set of aligned 222sequences. E.g.: 223 224 @align = ("GCCAUCCGAGGGAAAGGUU", 225 "GAUCGACAGCGUCU-AUCG", 226 "CCGUCUUUAUGAGUCCGGC"); 227 ($consens_struct, $consens_en) = RNA::alifold(\@align); 228 229=item consensus REF 230=item consens_mis REF 231 232compute a simple consensus sequence or "most informative sequence" form an 233alignment. The simple consensus returns the most frequent character for 234each column, the MIS uses the IUPAC symbol that contains all characters 235that are overrepresented in the column. 236 237 $mis = consensus_mis(\@align); 238 239 240=back 241 242Inverse Folding (from inverse.h) 243 244=over 4 245 246=item inverse_fold START, TARGET 247 248find a sequence that folds into structure TARGET, by optimizing the 249sequence until its mfe structure (as returned by L<fold|/fold>) is 250TARGET. Startpoint of the optimization is the sequence START. Returns 251a list containing the sequence found and the final value of the cost 252function, i.e. 0 if the search was successful. A random start sequence 253can be generated using L<random_string|/random_string>. 254 255=item inverse_pf_fold START, TARGET 256 257optimizes a sequence (beginning with START) by maximising the 258frequency of the structure TARGET in the thermodynamic ensemble 259of structures. Returns a list containing the optimized sequence and 260the final value of the cost function. The cost function is given by 261C<energy_of_struct(seq, TARGET) - pf_fold(seq)>, i.e.C<-RT*log(p(TARGET))> 262 263=item $final_cost [float] 264 265holds the value of the cost function where the optimization in 266C<inverse_pf_fold> should stop. For values <=0 the optimization will 267only terminate at a local optimimum (which might take very long to reach). 268 269=item $symbolset [char *] 270 271the string symbolset holds the allowed characters to be used by 272C<inverse_fold> and C<inverse_pf_fold>, the default alphabet is "AUGC" 273 274 275=item $give_up [int] 276 277If non-zero stop optimization when its clear that no exact solution 278can be found. Else continue and eventually return an approximate 279solution. Default 0. 280 281=back 282 283Cofolding of two RNA molecules (from cofold.h) 284 285=over 4 286 287 288=back 289 290Global Variables to Modify Folding (from fold_vars.h) 291 292=over 4 293 294=item $noGU [int] 295 296Do not allow GU pairs to form, default 0. 297 298=item $no_closingGU [int] 299 300allow GU only inside stacks, default 0. 301 302=item $tetra_loop [int] 303 304Fold with specially stable 4-loops, default 1. 305 306=item $energy_set [int] 307 3080 = BP; 1=any mit GC; 2=any mit AU-parameter, default 0. 309 310=item $dangles [int] 311 312How to compute dangling ends. 0: no dangling end energies, 1: "normal" 313dangling ends (default), 2: simplified dangling ends, 3: "normal" + 314co-axial stacking. Note that L<pf_fold|/pf_fold> treats cases 1 and 3 315as 2. The same holds for the main computation in L<subopt|/subopt>, 316however subopt will re-evalute energies using 317L<energy_of_struct|energy_of_struct> for cases 1 and 3. See the more 318detailed discussion in RNAlib.texinfo. 319 320=item $nonstandards [char *] 321 322contains allowed non standard bases, default empty string "" 323 324=item $temperature [double] 325 326temperature in degrees Celsius for rescaling parameters, default 37C. 327 328=item $logML [int] 329 330use logarithmic multiloop energy function in 331L<energy_of_struct|/energy_of_struct>, default 0. 332 333=item $noLonelyPairs [int] 334 335consider only structures without isolated base pairs (helices of length 1). 336For L<pf_fold|/pf_fold> only eliminates pairs 337that can B<only> occur as isolated pairs. Default 0. 338 339=item $base_pair [struct bond *] 340 341list of base pairs from last call to L<fold|/fold>. Better use 342the structure string returned by L<fold|/fold>. 343 344=item $pf_scale [double] 345 346scaling factor used by L<pf_fold|/pf_fold> to avoid overflows. Should 347be set to exp(-F/(RT*length)) where F is a guess for the ensmble free 348energy (e.g. use the mfe). 349 350 351=item $fold_constrained [int] 352 353apply constraints in the folding algorithms, default 0. 354 355=item $do_backtrack [int] 356 357If 0 do not compute the pair probabilities in L<pf_fold|/pf_fold> 358(only the partition function). Default 1. 359 360=item $backtrack_type [char] 361 362usually 'F'; 'C' require (1,N) to be bonded; 'M' backtrack as if the 363sequence was part of a multi loop. Used by L<inverse_fold|/inverse_fold> 364 365=item $pr [double *] 366 367the base pairing prob. matrix computed by L<pf_fold|/pf_fold>. 368 369=item $iindx [int *] 370 371Array of indices for moving withing the C<pr> array. Better use 372L<get_pr|/get_pr>. 373 374=back 375 376from move_set.h 377 378=over 4 379 380=item move_standard SEQUENCE, STRUCTURE, MOVE_TYPE, VERBOSITY, SHIFTS, noLP 381 382Walking method to find local minima. There are three different kinds available 383which can be cosen using the MOVE_TYPE enum: 3840 - GRADIENT: take the neighbouring structure with the lowest energy 3851 - FIRST: take the first neighbour with a lower energy 3862 - ADAPTIVE: randomly choose a neighbour 387STRUCTURE is the start structure and will also be used to return the target structure. 388Others are options set as integers. 389 390=back 391 392=head2 Parsing and Comparing Structures 393 394from RNAstruct.h: these functions convert between strings 395representating secondary structures with various levels of coarse 396graining. See the documentation of the C library for details 397 398=over 4 399 400=item b2HIT STRUCTURE 401 402Full -> HIT [incl. root] 403 404=item b2C STRUCTURE 405 406Full -> Coarse [incl. root] 407 408=item b2Shapiro STRUCTURE 409 410Full -> weighted Shapiro [i.r.] 411 412=item add_root STRUCTURE 413 414{Tree} -> ({Tree}R) 415 416=item expand_Shapiro COARSE 417 418add S for stacks to coarse struct 419 420=item expand_Full STRUCTURE 421 422Full -> FFull 423 424=item unexpand_Full FSTRUCTURE 425 426FFull -> Full 427 428=item unweight WCOARSE 429 430remove weights from coarse struct 431 432=item unexpand_aligned_F ALIGN 433 434 435 436=item parse_structure STRUCTURE 437 438computes structure statistics, and fills the following global variables: 439 440$loops [int] number of loops (and stacks) 441$unpaired [int] number of unpaired positions 442$pairs [int] number of paired positions 443$loop_size[int *] holds all loop sizes 444$loop_degree[int *] holds all loop degrees 445$helix_size[int *] holds all helix lengths 446 447=back 448 449from treedist.h: routines for computing tree-edit distances between structures 450 451=over 4 452 453=item make_tree XSTRUCT 454 455convert a structure string as produced by the expand_... functions to a 456Tree, useable as input to tree_edit_distance. 457 458=item tree_edit_distance T1, T2 459 460compare to structures using tree editing. C<T1>, C<T2> must have been 461created using C<tree_edit_distance> 462 463=item print_tree T 464 465mainly for debugging 466 467=item free_tree T 468 469free space allocated by make_tree 470 471=back 472 473from stringdist.h routines to compute structure distances via string-editing 474 475=over 4 476 477=item Make_swString STRUCTURE 478 479[ returns swString * ] 480make input for string_edit_distance 481 482=item string_edit_distance S1, S2 483 484[ returns float ] 485compare to structures using string alignment. C<S1>, C<S2> should be 486created using C<Make_swString> 487 488=back 489 490from profiledist 491 492=over 493 494=item Make_bp_profile LENGTH 495 496[ returns (float *) ] 497condense pair probability matrix C<pr> into a vector containing 498probabilities for unpaired, upstream paired and downstream paired. 499This resulting probability profile is used as input for 500profile_edit_distance 501 502=item profile_edit_distance T1, T2 503 504[ returns float ] 505align two probability profiles produced by C<Make_bp_profile> 506 507=item print_bppm T 508 509[ returns void ] 510print string representation of probability profile 511 512=item free_profile T 513 514[ returns void ] 515free space allocated in Make_bp_profile 516 517=back 518 519Global variables for computing structure distances 520 521=over 4 522 523=item $edit_backtrack [int] 524 525set to 1 if you want backtracking 526 527=item $aligned_line [(char *)[2]] 528 529containes alignmed structures after computing structure distance with 530C<edit_backtrack==1> 531 532=item $cost_matrix [int] 533 5340 usual costs (default), 1 Shapiro's costs 535 536=back 537 538=head2 Utilities (from utils.h) 539 540=over 4 541 542=item space SIZE 543 544allocate memory from C. Usually not needed in Perl 545 546=item nrerror MESSGAE 547 548die with error message. Better use Perl's C<die> 549 550=item $xsubi [unsigned short[3]] 551 552libRNA uses the rand48 48bit random number generator if available, the 553current random number is always stored in $xsubi. 554 555=item init_rand 556 557initialize the $xsubi random number from current time 558 559=item urn 560 561returns a random number between 0 and 1 using the random number 562generator from the RNA library. 563 564=item int_urn FROM, TO 565 566returns random integer in the range [FROM..TO] 567 568=item time_stamp 569 570current date in a string. In perl you might as well use C<locatime> 571 572=item random_string LENGTH, SYMBOLS 573 574returns a string of length LENGTH using characters from the string 575SYMBOLS 576 577=item hamming S1, S2 578 579calculate hamming distance of the strings C<S1> and C<S2>. 580 581 582=item pack_structure STRUCTURE 583 584pack secondary structure, using a 5:1 compression via 3 585encoding. Returns the packed string. 586 587=item unpack_structure PACKED 588 589unpacks a secondary structure packed with pack_structure 590 591=item make_pair_table STRUCTURE 592 593returns a pair table as a newly allocated (short *) C array, such 594that: table[i]=j if (i.j) pair or 0 if i is unpaired, table[0] 595contains the length of the structure. 596 597=item bp_distance STRUCTURE1, STRUCTURE2 598 599returns the base pair distance of the two STRUCTURES. dist = {number 600of base pairs in one structure but not in the other} same as edit 601distance with open-pair close-pair as move-set 602 603=back 604 605from PS_plot.h 606 607=over 4 608 609=item PS_rna_plot SEQUENCE, STRUCTURE, FILENAME 610 611write PostScript drawing of structure to FILENAME. Returns 1 on 612sucess, 0 else. 613 614=item PS_rna_plot_a SEQUENCE, STRUCTURE, FILENAME, PRE, POST 615 616write PostScript drawing of structure to FILENAME. The strings PRE and 617POST contain PostScript code that is included verbatim in the plot just 618before (after) the data. Returns 1 on sucess, 0 else. 619 620=item gmlRNA SEQUENCE, STRUCTURE, FILENAME, OPTION 621 622write structure drawing in gml (Graph Meta Language) to 623FILENAME. OPTION should be a single character. If uppercase the gml 624output will include the SEQUENCE as node labels. IF OPTION equal 'x' 625or 'X' write graph with coordinates (else only connectivity 626information). Returns 1 on sucess, 0 else. 627 628=item ssv_rna_plot SEQUENCE, STRUCTURE, SSFILE 629 630write structure drfawing as coord file for SStructView Returns 1 on 631sucess, 0 else. 632 633=item xrna_plot SEQUENCE, STRUCTURE, SSFILE 634 635write structure drawing as ".ss" file for further editing in XRNA. 636Returns 1 on sucess, 0 else. 637 638=item PS_dot_plot SEQUENCE, FILENAME 639 640write a PostScript dot plot of the pair probability matix to 641FILENAME. Returns 1 on sucess, 0 else. 642 643=item $rna_plot_type [int] 644 645Select layout algorithm for structure drawings. Currently available 6460= simple coordinates, 1= naview, default 1. 647 648=back 649 650from read_epars.c 651 652=over 4 653 654=item read_parameter_file FILENAME 655 656read energy parameters from FILENAME 657 658=item write_parameter_file FILENAME 659 660write energy parameters to FILENAME 661 662=back 663 664=head2 SWIG helper functions 665 666The package includes generic helper functions to access C arrays 667of type C<int>, C<float> and C<double>, such as: 668 669=over 4 670 671=item intP_getitem POINTER, INDEX 672 673return the element INDEX from the array 674 675=item intP_setitem POINTER, INDEX, VALUE 676 677set element INDEX to VALUE 678 679=item new_intP NELEM 680 681allocate a new C array of integers with NELEM elements and return the pointer 682 683=item delete_intP POINTER 684 685deletes the C array by calling free() 686 687=back 688 689substituting C<intP> with C<floatP>, C<doubleP>, C<ushortP>, 690C<shortP>, gives the corresponding functions for arrays of float or 691double, unsigned short, and short. You need to know the correct C 692type however, and the functions work only for arrays of simple types. 693Note, that the shortP... functions were used for unsigned short in previous 694versions, while starting with v1.8.3 it can only access signed short arrays. 695 696On the lowest level the C<cdata> function gives direct access to any data 697in the form of a Perl string. 698 699=over 700 701=item cdata POINTER, SIZE 702 703copies SIZE bytes at POINTER to a Perl string (with binary data) 704 705=item memmove POINTER, STRING 706 707copies the (binary) string STRING to the memory location pointed to by 708POINTER. 709Note: memmove is broken in current swig versions (e.g. 1.3.31) 710 711=back 712 713In combination with Perl's C<unpack> this provides a generic way to convert 714C data structures to Perl. E.g. 715 716 RNA::parse_structure($structure); # fills the $RNA::loop_degree array 717 @ldegrees = unpack "I*", RNA::cdata($RNA::loop_degree, ($RNA::loops+1)*4); 718 719Warning: using these functions with wrong arguments will corrupt your 720memory and lead to a segmentation fault. 721 722=head1 AUTHOR 723 724Ivo L. Hofacker <ivo@tbi.univie.ac.at> 725 726=cut 727