1 *** ALF - Alignment Free Sequence Comparison *** 2 http://www.seqan.de/projects/alf 3 January, 2012 4 5--------------------------------------------------------------------------- 6Table of Contents 7--------------------------------------------------------------------------- 8 9 1. Overview 10 2. Installation 11 3. Usage 12 4. Output Format 13 5. Example 14 6. Contact and Reference 15 16--------------------------------------------------------------------------- 171. Overview 18--------------------------------------------------------------------------- 19 20ALF can be used to calculate the pairwise similarity of sequences using 21alignment-free methods. All methods which are implemented are based on 22k-mer counts. More details can be found in the online documentation of the 23alignment-free methods (www.seqan.de). By default, ALF uses the 24N2 similarity measure. 25 26--------------------------------------------------------------------------- 272. Installation 28--------------------------------------------------------------------------- 29 30ALF is distributed with SeqAn - The C++ Sequence Analysis Library (see 31http://www.seqan.de). To build ALF from Git do the following: 32 33 1) git clone https://github.com/seqan/seqan.git 34 2) mkdir -p build/Release 35 3) cd build/Release 36 4) cmake ../../seqan -DCMAKE_BUILD_TYPE=Release 37 5) make alf 38 6) ./apps/alf/alf --help 39 40On success, an executable file alf was build and a brief usage description 41was dumped. 42 43For more information about retrieving SeqAn and prerequisites please visit 44 45 https://www.seqan.de/getting-started/ 46 47--------------------------------------------------------------------------- 483. Usage 49--------------------------------------------------------------------------- 50 51To get a short usage description of ALF, you can execute alf -h or 52alf --help. 53 54Usage: alf [OPTION]... -i <MULTI FASTA FILE> 55 56ALF expects one DNA (multi-)Fasta file. For all pairs of sequences, the 57pairwise scores will be computed. A matrix of pairwise scores will be 58returned. The default behaviour can be modified by specifying the following 59options at the command line: 60 61--------------------------------------------------------------------------- 623.1. Main Options 63--------------------------------------------------------------------------- 64 65 [ -i ], [ --input-file ] 66 67 Name of the multi fasta input file. Mandatory. 68 69 [ -o ], [ --output-file ] 70 71 Name of the file to which the tab delimited matrix with pairwise scores 72 will be written. Default: stdout. 73 74 [ -m ], [ --method ] 75 76 Method that will be udes for sequence comparison. 77 Default:N2 [N2, D2, D2Star, D2z] 78 79 [ -k ], [ --k-mer-size ] 80 81 Size of the k-mers that will be counted. Default:4 [integer] 82 83 [ -mo ], [ --bg-model-order ] 84 85 Order of background markov model for N2, D2Star, D2z. Default:1 [integer] 86 87 [ -rc ], [ --reverse-complement ] 88 89 N2 only. Specify how the k-mer counts from the reverse and foreward 90 strand should be combined. By default, only the input sequence is used 91 for the comparison. Select 'bothStrands' to calculate the pairwise score 92 using both strands from the input sequences. Default: input sequence 93 only. ['bothStrands','mean','min','max'] 94 95 [ -mm ], [ --mismatches ] 96 97 N2 only. Select -mm 1 if you want to include all words with one mismatch 98 to the k-mer neighbourhood. Default: Exact counts only [0,1] 99 100 [ -mmw ], [ --mismatch-weight ] 101 102 N2 only. Weight of counts for words with mismatches, only used in 103 combination with -mm 1. Default:0.1 [Double] 104 105 [ -kwf ], [ --k-mer-weights-file ] 106 107 N2 only. Print k-mer weights for every sequence to this file. 108 109 [ -v ], [ --verbose ] 110 111 Specify this option to print details on progress to the screen. 112 113 [ -h ], [ --help ] 114 115 Displays help message 116 117--------------------------------------------------------------------------- 1184. Output Format 119--------------------------------------------------------------------------- 120 121The program returns a (tab delimited) matrix with pairwise scores for all 122sequences from the input fasta file, for example: 123 124 1 0.046 0.052 125 0.046 1 0.992 126 0.052 0.992 1 127 128--------------------------------------------------------------------------- 1295. Example 130--------------------------------------------------------------------------- 131 132These examples use the fasta file "small.fasta" which can be found in 133seqan/apps/alf/example/. Copy this file to the directory where you 134execute alf. 135 136(1) Run ALF with default settings on two sequences: 137 138 ./alf -i small.fasta 139 140Output: 141 142 1 0.0463497 143 0.0463497 1 144 145(2) Calculate scores using N2 (-m N2), counting words of length 5 (-k 5) on 146both strands (-rc both_strands), including words with one mismatch into the 147word neighbourhood (-mm 1) with a weight of 0.5 (-mmw 0.5) and a background 148Markov model of order 1 (-mo 1), writing the output to a file 149(-o results.txt), saving all k-mer weights to a file (-kwf kmerWeights.txt): 150 151 ./alf -m N2 -k 5 -mo 1 -rc both_strands -mm 1 -mmw 0.5 -i small.fasta \ 152 -o results.txt -kwf kmerWeights.txt 153 154--------------------------------------------------------------------------- 1556. Contact and Reference 156--------------------------------------------------------------------------- 157 158For questions or comments, contact: 159 Jonathan Goeke <goeke@molgen.mpg.de> 160 161Please reference the following publication if you used ALF or the N2 method 162for your analysis: 163 164 Jonathan Goeke, Marcel H. Schulz, Julia Lasserre, and Martin Vingron. 165 Estimation of Pairwise Sequence Similarity of Mammalian Enhancers with 166 Word Neighbourhood Counts. Bioinformatics (2012). 167 168--------------------------------------------------------------------------- 1697. Version History 170--------------------------------------------------------------------------- 171 172* 2012-07-17: Version 1.1 173 - Updated ALF to use the new ArgumentParser for command line parsing. 174 - Changed long parameter names to use --parameter-name instead of 175 --parameterName. 176 177* 2012-01-05: Version 1.0 178 - Initial Release of ALF. 179