1RPS Blast: Reversed Position Specific Blast 2 3 4RPS-BLAST (Reverse PSI-BLAST) searches a query sequence against a database 5of profiles. This is the opposite of PSI-BLAST that searches a profile 6against a database of sequences, hence the 'Reverse'. RPS-BLAST 7uses a BLAST-like algorithm, finding single- or double-word hits 8and then performing an ungapped extension on these candidate matches. 9If a sufficiently high-scoring ungapped alignment is produced, a gapped 10extension is performed and those (gapped) alignments with sufficiently 11low expect value are reported. This procedure is in contrast to IMPALA 12that performs a Smith-Waterman calculation between the query and 13each profile, rather than using a word-hit approach to identify 14matches that should be extended. 15 16RPS-BLAST uses a BLAST database, with addition of some other files that 17contain a precomputed lookup table for the profiles to allow the search 18to proceed faster. Unfortunately it was not possible to make this 19lookup table architecture independent (like the BLAST databases themselves) 20and one cannot take an RPS-BLAST database prepared on a big-endian 21system (e.g., Solaris Sparc) and run it on a small-endian system 22(e.g., NT). The RPS-BLAST database must be prepared again for the small-endian 23system. 24 25The CD-Search databases for RPS-BLAST can be found at: 26 27 ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/ 28 29It is necessary to untar the archive and run copymat and formatdb. 30It is not necessary to run makemat on the databases from this 31directory. 32 33RPS-BLAST was coded by Sergei Shavirin with some help from Tom Madden. 34RPS-BLAST reuses some of the IMPALA code for precomputing the lookup tables 35and all of the IMPALA code for evaluating the statistical significance of a match. 36 37 381. Binary files used in RPS Blast: 39 40The following binary files are used to setup and run RPS Blast: 41 42makemat : primary profile preprocessor 43 (converts a collection of binary profiles, created by the -C option 44 of PSI-BLAST, into portable ASCII form); 45 46copymat : secondary profile preprocessor 47 (converts ASCII matrices, produced by the primary preprocessor, 48 into database that can be read into memory quickly); 49 50formatdb : general BLAST database formatter. 51 52rpsblast : search program (searches a database of score 53 matrices, prepared by copymat, producing BLAST-like output). 54 552. Conversion of profiles into searchable database 56 57*Note*: if you are starting with *.mtx files obtained from the NCBI FTP site or 58another source you should skip the steps listed in 2.1. 59 602.1. Primary preprocessing 61 62Prepare the following files: 63 64i. a collection of PSI-BLAST-generated profiles with arbitrary 65 names and suffix .chk; 66 67ii. a collection of "profile master sequences", associated with 68 the profiles, each in a separate file with arbitrary name and a 3 character 69 suffix starting with c; 70 the sequences can have deflines; they need not be sequences in nr or 71 in any other sequence database; if the sequences have deflines, then 72 the deflines must be unique. 73 74iii. a list of profile file names, one per line, named 75 <database_name>.pn; 76 77iv. a list of master sequence file names, one per line, in the same 78 order as a list of profile names, named 79 <database_name>.sn; 80 81The following files will be created: 82 83a. a collection of ASCII files, corresponding to each of the 84 original profiles, named 85 <profile_name>.mtx; 86 87b. a list of ASCII matrix files, named 88 <database_name>.mn; 89 90c. ASCII file with auxiliary information, named 91 <database_name>.aux; 92 93Arguments to makemat: 94 95 -P database name (required) 96 -G Cost to open a gap (optional) 97 default = 11 98 -E Cost to extend a gap (optional) 99 default = 1 100 -U Underlying amino acid scoring matrix (optional) 101 default = BLOSUM62 102 -d Underlying sequence database used to create profiles (optional) 103 default = nr 104 -z Effective size of sequence database given by -d 105 default = current size of -d option 106 Note: It may make sense to use -z without -d when the 107 profiles were created with an older, smaller version of an 108 existing database 109 -S Scaling factor for matrix outputs to avoid round-off problems 110 default = PRO_DEFAULT_SCALING_UP (currently defined as 100) 111 Use 1.0 to have no scaling 112 Output scores will be scaled back down to a unit scale to make 113 them look more like BLAST scores, but we found working with a larger 114 scale to help with roundoff problems. 115 -H get help (overrides all other arguments) 116Note: It is not enforced that the values of -G and -E passed to makemat 117were actually used in making the checkpoints. However, the values fed 118in to makemat are propagated to copymat and rpsblast. 119 120ATTENTION: It is strongly recommended to use -S 1 - the scaling factor 121 should be set to 1 for rpsblast at this point in time. 122 1232.2. Secondary preprocessing 124 125Prepare the following files: 126 127i. a collection of ASCII files, corresponding to each of the 128 original profiles, named 129 <profile_name>.mtx 130(created by makemat); 131 132ii. a collection of "profile master sequences", associated with 133 the profiles, each in a separate file with arbitrary name and a 3 character 134 suffix starting with c. 135 136iii. a list of ASCII_matrix files, named 137 <database_name>.mn 138 (created by makemat); 139 140iv. a list of master sequence file names, one per 141 line, in the same order as a list of matrix names, named 142 <database_name>.sn; 143 144v. ASCII file with auxiliary information, named 145 <database_name>.aux 146(created by makemat); 147 148The files input to copymatices are in ASCII format and thus portable 149between machines with different encodings for machine-readable files 150 151The following files will be created: 152 153a. a huge binary file, containing all profile matrices, named 154 <database_name>.rps; 155b. a huge binary file, containing lookup table for the Blast search 156 corresponding to matrixes named <database_name>.loo 157c. File containing concatenation of all FASTA "profile master sequences". 158 named <database_name> (without extention) 159 160Arguments to copymat 161 162 -P database name (required) 163 -H get help (overrides all other arguments) 164 -r format data for RPS Blast 165 166ATTENTION: "-r" parameter have to be set to TRUE to format data for 167 RPS Blast at this step. 168 169NOTE: copymat requires a fair amount of memory as it first constructs 170the the lookup table in memory before writing it to disk. Users have 171found that they require a machine with at least 500 Meg of memory for this 172task. 173 1742.3 Creating of BLAST database from <database_name> file containing 175 all "profile master sequences". 176 177"formatdb" program should be run to create regular BLAST database of all 178"profile master sequences": 179 180 formatdb -i <database_name> -o T 181 1823. Search 183 184Arguments to RPS Blast 185 186 -i query sequence file (required) 187 -p if query sequence protein (if FALSE 6 frame franslation will be 188 conducted as in blastx program) 189 -P database of profiles (required) 190 -o output file (optional) 191 default = stdout 192 -e Expectation value threshold (E), (optional, same as for BLAST) 193 default = 10 194 -m alignment view (optional, same as for BLAST) 195 -z effective length of database (optional) 196 -1 = length given via -z option to makemat 197 default (0) implies length is actual length of profile library 198 adjusted for end effects 199 200 201 202