1RPS Blast: Reversed Position Specific Blast
2
3
4RPS-BLAST (Reverse PSI-BLAST) searches a query sequence against a database
5of profiles.  This is the opposite of PSI-BLAST that searches a profile
6against a database of sequences, hence the 'Reverse'.  RPS-BLAST
7uses a BLAST-like algorithm, finding single- or double-word hits
8and then performing an ungapped extension on these candidate matches.
9If a sufficiently high-scoring ungapped alignment is produced, a gapped
10extension is performed and those (gapped) alignments with sufficiently
11low expect value are reported.  This procedure is in contrast to IMPALA
12that performs a Smith-Waterman calculation between the query and
13each profile, rather than using a word-hit approach to identify
14matches that should be extended.
15
16RPS-BLAST uses a BLAST database, with addition of some other files that
17contain a precomputed lookup table for the profiles to allow the search
18to proceed faster.  Unfortunately it was not possible to make this
19lookup table architecture independent (like the BLAST databases themselves)
20and one cannot take an RPS-BLAST database prepared on a big-endian
21system (e.g., Solaris Sparc) and run it on a small-endian system
22(e.g., NT).  The RPS-BLAST database must be prepared again for the small-endian
23system.
24
25The CD-Search databases for RPS-BLAST can be found at:
26
27 ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/
28
29It is necessary to untar the archive and run copymat and formatdb.
30It is not necessary to run makemat on the databases from this
31directory.
32
33RPS-BLAST was coded by Sergei Shavirin with some help from Tom Madden.
34RPS-BLAST reuses some of the IMPALA code for precomputing the lookup tables
35and all of the IMPALA code for evaluating the statistical significance of a match.
36
37
381. Binary files used in RPS Blast:
39
40The following binary files are used to setup and run RPS Blast:
41
42makemat	: primary profile preprocessor
43  (converts a collection of binary profiles, created by the -C option
44   of PSI-BLAST, into portable ASCII form);
45
46copymat	: secondary profile preprocessor
47  (converts ASCII matrices, produced by the primary preprocessor,
48   into database that can be read into memory quickly);
49
50formatdb  : general BLAST database formatter.
51
52rpsblast  : search program (searches a database of score
53  matrices, prepared by copymat, producing BLAST-like output).
54
552. Conversion of profiles into searchable database
56
57*Note*: if you are starting with *.mtx files obtained from the NCBI FTP site or
58another source you should skip the steps listed in 2.1.
59
602.1. Primary preprocessing
61
62Prepare the following files:
63
64i.	a collection of PSI-BLAST-generated profiles with arbitrary
65       names and suffix .chk;
66
67ii.	a collection of "profile master sequences", associated with
68    the profiles, each in a separate file with arbitrary name and a 3 character
69    suffix starting with c;
70    the sequences can have deflines; they need not be sequences in nr or
71    in any other sequence database; if the sequences have deflines, then
72    the deflines must be unique.
73
74iii.	a list of profile file names, one per line, named
75    <database_name>.pn;
76
77iv.	a list of master sequence file names, one per line, in the same
78    order as a list of profile names, named
79     <database_name>.sn;
80
81The following files will be created:
82
83a.	a collection of ASCII files, corresponding to each of the
84      original profiles, named
85     <profile_name>.mtx;
86
87b.	a list of ASCII matrix files, named
88      <database_name>.mn;
89
90c.	ASCII file with auxiliary information, named
91       <database_name>.aux;
92
93Arguments to makemat:
94
95    -P database name (required)
96    -G Cost to open a gap (optional)
97       default = 11
98    -E Cost to extend a gap (optional)
99       default = 1
100    -U Underlying amino acid scoring matrix (optional)
101       default = BLOSUM62
102    -d Underlying sequence database used to create profiles (optional)
103       default = nr
104    -z Effective size of sequence database given by -d
105       default = current size of -d option
106       Note: It may make sense to use -z without -d when the
107       profiles were created with an older, smaller version of an
108       existing database
109    -S  Scaling factor for  matrix outputs to avoid round-off problems
110       default = PRO_DEFAULT_SCALING_UP (currently defined as 100)
111       Use 1.0 to have no scaling
112       Output scores will be scaled back down to a unit scale to make
113       them look more like BLAST scores, but we found working with a larger
114       scale to help with roundoff problems.
115    -H get help (overrides all other arguments)
116Note: It is not enforced that the values of -G and -E passed to makemat
117were actually used in making the checkpoints. However, the values fed
118in to makemat are propagated to copymat and rpsblast.
119
120ATTENTION: It is strongly recommended to use -S 1 - the scaling factor
121	    should be set to 1 for rpsblast at this point in time.
122
1232.2. Secondary preprocessing
124
125Prepare the following files:
126
127i.	a collection of ASCII files, corresponding to each of the
128  original profiles, named
129  <profile_name>.mtx
130(created by makemat);
131
132ii.	a collection of "profile master sequences", associated with
133  the profiles, each in a separate file with arbitrary name and a 3 character
134  suffix starting with c.
135
136iii.	a list of ASCII_matrix files, named
137     <database_name>.mn
138   (created by makemat);
139
140iv.	a list of master sequence file names, one per
141  line, in the same order as a list of matrix names, named
142  <database_name>.sn;
143
144v.	ASCII file with auxiliary information, named
145  <database_name>.aux
146(created by makemat);
147
148The files input to copymatices are in ASCII format and thus portable
149between machines with different encodings for machine-readable files
150
151The following files will be created:
152
153a.	a huge binary file, containing all profile matrices, named
154 <database_name>.rps;
155b.     a huge binary file, containing lookup table for the Blast search
156 corresponding to matrixes named <database_name>.loo
157c.    File containing concatenation of all FASTA  "profile master sequences".
158     named  <database_name> (without extention)
159
160Arguments to copymat
161
162    -P database name (required)
163    -H get help (overrides all other arguments)
164    -r format data for RPS Blast
165
166ATTENTION: "-r" parameter have to be set to TRUE to format data for
167           RPS Blast at this step.
168
169NOTE: copymat requires a fair amount of memory as it first constructs
170the the lookup table in memory before writing it to disk.  Users have
171found that they require a machine with at least 500 Meg of memory for this
172task.
173
1742.3 Creating of BLAST database from <database_name> file containing
175    all "profile master sequences".
176
177"formatdb" program should be run to create regular BLAST database of all
178"profile master sequences":
179
180    formatdb -i <database_name>    -o T
181
1823. Search
183
184Arguments to RPS Blast
185
186   -i  query sequence file (required)
187   -p  if query sequence protein (if FALSE 6 frame franslation will be
188                                  conducted as in blastx program)
189   -P  database of profiles (required)
190   -o  output file (optional)
191       default = stdout
192   -e  Expectation value threshold  (E), (optional, same as for BLAST)
193       default = 10
194   -m  alignment view (optional, same as for BLAST)
195   -z  effective length of database (optional)
196       -1 = length given via -z option to makemat
197       default (0) implies  length is actual length of profile library
198          adjusted for end effects
199
200
201
202