1
2
3
4
5
6 """Bio.SearchIO support for BLAST+ output formats.
7
8 This module adds support for parsing BLAST+ outputs. BLAST+ is a rewrite of
9 NCBI's legacy BLAST (Basic Local Alignment Search Tool), based on the NCBI
10 C++ toolkit. The BLAST+ suite is available as command line programs or on
11 NCBI's web page.
12
13 Bio.SearchIO.BlastIO was tested on the following BLAST+ flavors and versions:
14
15 - flavors: blastn, blastp, blastx, tblastn, tblastx
16 - versions: 2.2.22+, 2.2.26+
17
18 You should also be able to parse outputs from a local BLAST+ search or from
19 NCBI's web interface. Although the module was not tested against all BLAST+,
20 it should still be able to parse these other versions' outputs. Please submit
21 a bug report if you stumble upon an unparseable file.
22
23 Some output formats from the BLAST legacy suite (BLAST+'s predecessor) may
24 still be parsed by this module. However, results are not guaranteed. You may
25 try to use the Bio.Blast module to parse them instead.
26
27 More information about BLAST are available through these links:
28 - Publication: http://www.biomedcentral.com/1471-2105/10/421
29 - Web interface: http://blast.ncbi.nlm.nih.gov/
30 - User guide: http://www.ncbi.nlm.nih.gov/books/NBK1762/
31
32
33 Supported Formats
34 =================
35
36 Bio.SearchIO.BlastIO supports the following BLAST+ output formats:
37
38 - XML - 'blast-xml' - parsing, indexing, writing
39 - Tabular - 'blast-tab' - parsing, indexing, writing
40 - Plain text - 'blast-text' - parsing
41
42
43 blast-xml
44 =========
45
46 The blast-xml parser follows the BLAST XML DTD written here:
47 http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.mod.dtd
48
49 It provides the following attributes for each SearchIO object:
50
51 +----------------+-------------------------+-----------------------------+
52 | Object | Attribute | XML Element |
53 +================+=========================+=============================+
54 | QueryResult | target | BlastOutput_db |
55 | +-------------------------+-----------------------------+
56 | | program | BlastOutput_program |
57 | +-------------------------+-----------------------------+
58 | | reference | BlastOutput_reference |
59 | +-------------------------+-----------------------------+
60 | | version | BlastOutput_version* |
61 | +-------------------------+-----------------------------+
62 | | description | Iteration_query-def |
63 | +-------------------------+-----------------------------+
64 | | id | Iteration_query-ID |
65 | +-------------------------+-----------------------------+
66 | | seq_len | Iteration_query-len |
67 | +-------------------------+-----------------------------+
68 | | param_evalue_threshold | Parameters_expect |
69 | +-------------------------+-----------------------------+
70 | | param_entrez_query | Parameters_entrez-query |
71 | +-------------------------+-----------------------------+
72 | | param_filter | Parameters_filter |
73 | +-------------------------+-----------------------------+
74 | | param_gap_extend | Parameters_gap-extend |
75 | +-------------------------+-----------------------------+
76 | | param_gap_open | Parameters_gap-open |
77 | +-------------------------+-----------------------------+
78 | | param_include | Parameters_include |
79 | +-------------------------+-----------------------------+
80 | | param_matrix | Parameters_matrix |
81 | +-------------------------+-----------------------------+
82 | | param_pattern | Parameters_pattern |
83 | +-------------------------+-----------------------------+
84 | | param_score_match | Parameters_sc-match |
85 | +-------------------------+-----------------------------+
86 | | param_score_mismatch | Parameters_sc-mismatch |
87 | +-------------------------+-----------------------------+
88 | | stat_db_num | Statistics_db-num |
89 | +-------------------------+-----------------------------+
90 | | stat_db_len | Statistics_db-len |
91 | +-------------------------+-----------------------------+
92 | | stat_eff_space | Statistics_eff-space |
93 | +-------------------------+-----------------------------+
94 | | stat_entropy | Statistics_entropy |
95 | +-------------------------+-----------------------------+
96 | | stat_hsp_len | Statistics_hsp-len |
97 | +-------------------------+-----------------------------+
98 | | stat_kappa | Statistics_kappa |
99 | +-------------------------+-----------------------------+
100 | | stat_lambda | Statistics_lambda |
101 +----------------+-------------------------+-----------------------------+
102 | Hit | accession | Hit_accession |
103 | +-------------------------+-----------------------------+
104 | | description | Hit_def |
105 | +-------------------------+-----------------------------+
106 | | id | Hit_id |
107 | +-------------------------+-----------------------------+
108 | | seq_len | Hit_len |
109 +----------------+-------------------------+-----------------------------+
110 | HSP | bitscore | Hsp_bit-score |
111 | +-------------------------+-----------------------------+
112 | | density | Hsp_density |
113 | +-------------------------+-----------------------------+
114 | | evalue | Hsp_evalue |
115 | +-------------------------+-----------------------------+
116 | | gap_num | Hsp_gaps |
117 | +-------------------------+-----------------------------+
118 | | ident_num | Hsp_identity |
119 | +-------------------------+-----------------------------+
120 | | pos_num | Hsp_positive |
121 | +-------------------------+-----------------------------+
122 | | bitscore_raw | Hsp_score |
123 +----------------+-------------------------+-----------------------------+
124 | HSPFragment | aln_span | Hsp_align-len |
125 | (also via +-------------------------+-----------------------------+
126 | HSP) | hit_frame | Hsp_hit-frame |
127 | +-------------------------+-----------------------------+
128 | | hit_start | Hsp_hit-from |
129 | +-------------------------+-----------------------------+
130 | | hit_end | Hsp_hit-to |
131 | +-------------------------+-----------------------------+
132 | | hit | Hsp_hseq |
133 | +-------------------------+-----------------------------+
134 | | aln_annotation | Hsp_midline |
135 | +-------------------------+-----------------------------+
136 | | pattern_start | Hsp_pattern-from |
137 | +-------------------------+-----------------------------+
138 | | pattern_end | Hsp_pattern-to |
139 | +-------------------------+-----------------------------+
140 | | query_frame | Hsp_query-frame |
141 | +-------------------------+-----------------------------+
142 | | query_start | Hsp_query-from |
143 | +-------------------------+-----------------------------+
144 | | query_end | Hsp_query-to |
145 | +-------------------------+-----------------------------+
146 | | query | Hsp_qseq |
147 +----------------+-------------------------+-----------------------------+
148 * may be modified
149
150 You may notice that in BLAST XML files, sometimes BLAST replaces your true
151 sequence ID with its own generated ID. For example, the query IDs become
152 'Query_1', 'Query_2', and so on. While the hit IDs sometimes become
153 'gnl|BL_ORD_ID|1', 'gnl|BL_ORD_ID|2', and so on. In these cases, BLAST lumps the
154 true sequence IDs together with their descriptions.
155
156 The blast-xml parser is aware of these modifications and will attempt to extract
157 the true sequence IDs out of the descriptions. So when accessing QueryResult or
158 Hit objects, you will use the non-BLAST-generated IDs.
159
160 Conversely, the blast-xml writer will try to concatenate the true sequence IDs
161 with their descriptions and use the BLAST-generated IDs. This enables you to
162 write BLAST XML files using SearchIO as if they were written by a real BLAST
163 program.
164
165
166 blast-tab
167 =========
168
169 The default format for blast-tab support is the variant without comments (-m 6
170 flag). Commented BLAST tabular files may be parsed, indexed, or written using
171 the keyword argument 'comments' set to True:
172
173 # blast-tab defaults to parsing uncommented files
174 >>> from Bio import SearchIO
175 >>> uncommented = 'Blast/tab_2226_tblastn_004.txt'
176 >>> qresult = SearchIO.read(uncommented, 'blast-tab')
177 >>> qresult
178 QueryResult(id='gi|11464971:4-101', 5 hits)
179
180 # set the keyword argument to parse commented files
181 >>> commented = 'Blast/tab_2226_tblastn_008.txt'
182 >>> qresult = SearchIO.read(commented, 'blast-tab', comments=True)
183 >>> qresult
184 QueryResult(id='gi|11464971:4-101', 5 hits)
185
186 For uncommented files, the parser defaults to using BLAST's default column
187 ordering: 'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
188 evalue bitscore'.
189
190 If you want to parse an uncommented file with a customized column order, you can
191 use the 'fields' keyword argument to pass the custom column order. The names of
192 the column follow BLAST's naming. For example, 'qseqid' is the column for the
193 query sequence ID. These names may be passed either as a Python list or as a
194 space-separated strings.
195
196 # pass the custom column names as a Python list
197 >>> fname = 'Blast/tab_2226_tblastn_009.txt'
198 >>> custom_fields = ['qseqid', 'sseqid']
199 >>> qresult = SearchIO.parse(fname, 'blast-tab', fields=custom_fields).next()
200 >>> qresult
201 QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)
202
203 # pass the custom column names as a space-separated string
204 >>> fname = 'Blast/tab_2226_tblastn_009.txt'
205 >>> custom_fields = 'qseqid sseqid'
206 >>> qresult = SearchIO.parse(fname, 'blast-tab', fields=custom_fields).next()
207 >>> qresult
208 QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)
209
210 You may also use the 'std' field name as an alias to BLAST's default 12 columns,
211 just like when you run a command line BLAST search.
212
213 Note that the 'fields' keyword argument will be ignored if the parsed file is
214 commented. Commented files have their column ordering stated explicitly in the
215 file, so there is no need to specify it again in SearchIO.
216
217 'comments' and 'fields' keyword arguments are both applicable for parsing,
218 indexing, and writing.
219
220 blast-tab provides the following attributes for each SearchIO objects:
221
222 +-------------+-------------------+--------------+
223 | Object | Attribute | Column name |
224 +=============+===================+==============+
225 | QueryResult | accession | qacc |
226 | +-------------------+--------------+
227 | | accession_version | qaccver |
228 | +-------------------+--------------+
229 | | gi | qgi |
230 | +-------------------+--------------+
231 | | seq_len | qlen |
232 | +-------------------+--------------+
233 | | id | qseqid |
234 +-------------+-------------------+--------------+
235 | Hit | accession | sacc |
236 | +-------------------+--------------+
237 | | accession_version | sacc_ver |
238 | +-------------------+--------------+
239 | | gi | sgi |
240 | +-------------------+--------------+
241 | | gi_all | sallgi |
242 | +-------------------+--------------+
243 | | id_all | sallseqid |
244 | +-------------------+--------------+
245 | | seq_len | slen |
246 | +-------------------+--------------+
247 | | id | sseqid |
248 +-------------+-------------------+--------------+
249 | HSP | bitscore | bitscore |
250 | +-------------------+--------------+
251 | | btop | btop |
252 | +-------------------+--------------+
253 | | evalue | evalue |
254 | +-------------------+--------------+
255 | | gapopen_num | gapopen |
256 | +-------------------+--------------+
257 | | gap_num | gaps |
258 | +-------------------+--------------+
259 | | ident_pct | nident |
260 | +-------------------+--------------+
261 | | ident_num | pident |
262 | +-------------------+--------------+
263 | | mismatch_num | mismatch |
264 | +-------------------+--------------+
265 | | pos_pct | ppos |
266 | +-------------------+--------------+
267 | | pos_num | positive |
268 | +-------------------+--------------+
269 | | bitscore_raw | score |
270 +-------------+-------------------+--------------+
271 | HSPFragment | frames | frames* |
272 | (also via +-------------------+--------------+
273 | HSP) | aln_span | length |
274 | +-------------------+--------------+
275 | | query_end | qend |
276 | +-------------------+--------------+
277 | | query_frame | qframe |
278 | +-------------------+--------------+
279 | | query | qseq |
280 | +-------------------+--------------+
281 | | query_start | qstart |
282 | +-------------------+--------------+
283 | | hit_end | send |
284 | +-------------------+--------------+
285 | | hit_frame | sframe |
286 | +-------------------+--------------+
287 | | hit | sseq |
288 | +-------------------+--------------+
289 | | hit_start | sstart |
290 +-------------+-------------------+--------------+
291 * When 'frames' is present, both `query_frame` and `hit_frame` will be present
292 as well. It is recommended that you use these instead of 'frames' directly.
293
294 If the parsed file is commented, the following attributes may be available as
295 well:
296
297 +--------------+---------------+----------------------------+
298 | Object | Attribute | Value |
299 +==============+===============+============================+
300 | QueryResult | description | query description |
301 | +---------------+----------------------------+
302 | | fields | columns in the output file |
303 | +---------------+----------------------------+
304 | | program | BLAST flavor |
305 | +---------------+----------------------------+
306 | | rid | remote search ID |
307 | +---------------+----------------------------+
308 | | target | target database |
309 | +---------------+----------------------------+
310 | | version | BLAST version |
311 +--------------+---------------+----------------------------+
312
313
314 blast-text
315 ==========
316 The BLAST plain text output format has been known to change considerably between
317 BLAST versions. NCBI itself has recommended that users not rely on the plain
318 text output for parsing-related work.
319
320 However, in some cases parsing the plain text output may still be useful.
321 SearchIO provides parsing support for the plain text output, but guarantees only
322 a minimum level of support. Writing a parser that fully supports plain text
323 output for all BLAST versions is not a priority at the moment.
324
325 If you do have a BLAST plain text file that can not be parsed and would like to
326 submit a patch, we are more than happy to accept it.
327
328 The blast-text parser provides the following object attributes:
329
330 +-----------------+-------------------------+----------------------------------+
331 | Object | Attribute | Value |
332 +=================+=========================+==================================+
333 | QueryResult | description | query sequence description |
334 | +-------------------------+----------------------------------+
335 | | id | query sequence ID |
336 | +-------------------------+----------------------------------+
337 | | program | BLAST flavor |
338 | +-------------------------+----------------------------------+
339 | | seq_len | full length of query sequence |
340 | +-------------------------+----------------------------------+
341 | | target | target database of the search |
342 | +-------------------------+----------------------------------+
343 | | version | BLAST version |
344 +-----------------+-------------------------+----------------------------------+
345 | Hit | evalue | hit-level evalue, from the hit |
346 | | | table |
347 | +-------------------------+----------------------------------+
348 | | id | hit sequence ID |
349 | +-------------------------+----------------------------------+
350 | | description | hit sequence description |
351 | +-------------------------+----------------------------------+
352 | | score | hit-level score, from the hit |
353 | | | table |
354 | +-------------------------+----------------------------------+
355 | | seq_len | full length of hit sequence |
356 +-----------------+-------------------------+----------------------------------+
357 | HSP | evalue | hsp-level evalue |
358 | +-------------------------+----------------------------------+
359 | | bitscore | hsp-level bit score |
360 | +-------------------------+----------------------------------+
361 | | bitscore_raw | hsp-level score |
362 | +-------------------------+----------------------------------+
363 | | gap_num | number of gaps in alignment |
364 | +-------------------------+----------------------------------+
365 | | ident_num | number of identical residues |
366 | | | in alignment |
367 | +-------------------------+----------------------------------+
368 | | pos_num | number of positive matches in |
369 | | | alignment |
370 +-----------------+-------------------------+----------------------------------+
371 | HSPFragment | aln_annotation | alignment homology string |
372 | (also via +-------------------------+----------------------------------+
373 | HSP) | aln_span | length of alignment fragment |
374 | +-------------------------+----------------------------------+
375 | | hit | hit sequence |
376 | +-------------------------+----------------------------------+
377 | | hit_end | hit sequence end coordinate |
378 | +-------------------------+----------------------------------+
379 | | hit_frame | hit sequence reading frame |
380 | +-------------------------+----------------------------------+
381 | | hit_start | hit sequence start coordinate |
382 | +-------------------------+----------------------------------+
383 | | hit_strand | hit sequence strand |
384 | +-------------------------+----------------------------------+
385 | | query | query sequence |
386 | +-------------------------+----------------------------------+
387 | | query_end | query sequence end coordinate |
388 | +-------------------------+----------------------------------+
389 | | query_frame | query sequence reading frame |
390 | +-------------------------+----------------------------------+
391 | | query_start | query sequence start coordinate |
392 | +-------------------------+----------------------------------+
393 | | query_strand | query sequence strand |
394 +-----------------+-------------------------+----------------------------------+
395
396 """
397
398 from blast_tab import *
399 from blast_xml import *
400 from blast_text import *
401
402
403
404 if __name__ == "__main__":
405 from Bio._utils import run_doctest
406 run_doctest()
407