1
2
3
4
5
6 """Bio.SearchIO parser for BLAT output formats.
7
8 This module adds support for parsing BLAT outputs. BLAT (BLAST-Like Alignment
9 Tool) is a sequence homology search program initially built for annotating
10 the human genome.
11
12 Bio.SearchIO.BlastIO was tested using standalone BLAT version 34, psLayout
13 version 3. It should be able to parse psLayout version 4 without problems.
14
15 More information on BLAT is available from these sites:
16
17 - Publication: http://genome.cshlp.org/content/12/4/656
18 - User guide: http://genome.ucsc.edu/goldenPath/help/blatSpec.html
19 - Source download: http://www.soe.ucsc.edu/~kent/src
20 - Executable download: http://hgdownload.cse.ucsc.edu/admin/exe/
21 - Blat score calculation: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4
22
23
24 Supported Formats
25 =================
26
27 BlatIO supports parsing, indexing, and writing for both PSL and PSLX output
28 formats, with or without header. To parse, index, or write PSLX files, use the
29 'pslx' keyword argument and set it to True.
30
31 # blat-psl defaults to PSL files
32 >>> from Bio import SearchIO
33 >>> psl = 'Blat/psl_34_004.psl'
34 >>> qresult = SearchIO.read(psl, 'blat-psl')
35 >>> qresult
36 QueryResult(id='hg19_dna', 10 hits)
37
38 # set the pslx flag to parse PSLX files
39 >>> pslx = 'Blat/pslx_34_004.pslx'
40 >>> qresult = SearchIO.read(pslx, 'blat-psl', pslx=True)
41 >>> qresult
42 QueryResult(id='hg19_dna', 10 hits)
43
44 For parsing and indexing, you do not need to specify whether the file has a
45 header or not. For writing, if you want to write a header, you can set the
46 'header' keyword argument to True. This will write a 'psLayout version 3' header
47 to your output file.
48
49 from Bio import SearchIO
50 qresult = SearchIO.read(psl, 'blat-psl')
51 SearchIO.write(qresult, 'header.psl', header=True)
52 <stdout> (1, 10, 19, 23)
53
54 Note that the number of HSPFragments written may exceed the number of HSP
55 objects. This is because in PSL files, it is possible to have single matches
56 consisting of noncontiguous sequence fragments. This is where the HSPFragment
57 object comes into play. These fragments are grouped into a single HSP because
58 they share the same statistics (e.g. match numbers, BLAT score, etc.). However,
59 they do not share the same sequence attributes, such as the start and end
60 coordinates, making them distinct objects.
61
62 In addition to parsing PSL(X) files, BlatIO also computes the percent identities
63 and scores of your search results. This is done using the calculation formula
64 posted here: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4. It mimics the score
65 and percent identity calculation done by UCSC's web BLAT service.
66
67 Since BlatIO parses the file in a single pass, it expects all results from
68 the same query to be in consecutive rows. If the results from one query are
69 spread in nonconsecutive rows, BlatIO will consider them to be separate
70 QueryResult objects.
71
72 In most cases, the PSL(X) format uses the same coordinate system as Python
73 (zero-based, half open). These coordinates are anchored on the plus strand.
74 However, if the query aligns on the minus strand, BLAT will anchor the qStarts
75 coordinates on the minus strand instead. BlatIO is aware of this, and will
76 re-anchor the qStarts coordinates to the plus strand whenever it sees a minus
77 strand query match. Conversely, when you write out to a PSL(X) file, BlatIO will
78 reanchor qStarts to the minus strand again.
79
80 BlatIO provides the following attribute-column mapping:
81
82 +----------------+-------------------------+-----------------------------------+
83 | Object | Attribute | Column Name, Value |
84 +================+=========================+===================================+
85 | QueryResutl | id | Q name, query sequence ID |
86 | +-------------------------+-----------------------------------+
87 | | seq_len | Q size, query sequence full |
88 | | | length |
89 +----------------+-------------------------+-----------------------------------+
90 | Hit | id | T name, hit sequence ID |
91 | +-------------------------+-----------------------------------+
92 | | seq_len | T size, hit sequence full length |
93 +----------------+-------------------------+-----------------------------------+
94 | HSP | hit_end | T end, end coordinate of the last |
95 | | | hit fragment |
96 | +-------------------------+-----------------------------------+
97 | | hit_gap_num | T gap bases, number of bases |
98 | | | inserted in hit |
99 | +-------------------------+-----------------------------------+
100 | | hit_gapopen_num | T gap count, number of hit gap |
101 | | | inserts |
102 | +-------------------------+-----------------------------------+
103 | | hit_span_all | blockSizes, sizes of each |
104 | | | fragment |
105 | +-------------------------+-----------------------------------+
106 | | hit_start | T start, start coordinate of the |
107 | | | first hit fragment |
108 | +-------------------------+-----------------------------------+
109 | | hit_start_all | tStarts, start coordinate of each |
110 | | | hit fragment |
111 | +-------------------------+-----------------------------------+
112 | | match_num | match, number of non-repeat |
113 | | | matches |
114 | +-------------------------+-----------------------------------+
115 | | mismatch_num | mismatch, number of mismatches |
116 | +-------------------------+-----------------------------------+
117 | | match_rep_num | rep. match, number of matches |
118 | | | that are part of repeats |
119 | +-------------------------+-----------------------------------+
120 | | n_num | N's, number of N bases |
121 | +-------------------------+-----------------------------------+
122 | | query_end | Q end, end coordinate of the last |
123 | +-------------------------+-----------------------------------+
124 | | | query fragment |
125 | | query_gap_num | Q gap bases, number of bases |
126 | | | inserted in query |
127 | +-------------------------+-----------------------------------+
128 | | query_gapopen_num | Q gap count, number of query gap |
129 | | | inserts |
130 | +-------------------------+-----------------------------------+
131 | | query_span_all | blockSizes, sizes of each |
132 | | | fragment |
133 | +-------------------------+-----------------------------------+
134 | | query_start | Q start, start coordinate of the |
135 | | | first query block |
136 | +-------------------------+-----------------------------------+
137 | | query_start_all | qStarts, start coordinate of each |
138 | | | query fragment |
139 | +-------------------------+-----------------------------------+
140 | | `len`* | block count, the number of blocks |
141 | | | in the alignment |
142 +----------------+-------------------------+-----------------------------------+
143 | HSPFragment | hit | hit sequence, if present |
144 | +-------------------------+-----------------------------------+
145 | | hit_strand | strand, hit sequence strand |
146 | +-------------------------+-----------------------------------+
147 | | query | query sequence, if present |
148 | +-------------------------+-----------------------------------+
149 | | query_strand | strand, query sequence strand |
150 +----------------+-------------------------+-----------------------------------+
151 * You can obtain the number of blocks / fragments in the HSP by invoking `len`
152 on the HSP
153
154 In addition to the column mappings above, BlatIO also provides the following
155 object attributes:
156
157 +----------------+-------------------------+-----------------------------------+
158 | Object | Attribute | Value |
159 +================+=========================+===================================+
160 | HSP | gapopen_num | Q gap count + T gap count, total |
161 | | | number of gap openings |
162 | +-------------------------+-----------------------------------+
163 | | ident_num | matches + repmatches, total |
164 | | | number of identical residues |
165 | +-------------------------+-----------------------------------+
166 | | ident_pct | percent identity, calculated |
167 | | | using UCSC's formula |
168 | +-------------------------+-----------------------------------+
169 | | query_is_protein | boolean, whether the query |
170 | | | sequence is a protein |
171 | +-------------------------+-----------------------------------+
172 | | score | HSP score, calculated using |
173 | | | UCSC's formula |
174 +----------------+-------------------------+-----------------------------------+
175
176 Finally, the default HSP and HSPFragment properties are also provided. See the
177 HSP and HSPFragment documentation for more details on these properties.
178
179 """
180
181 import re
182 from math import log
183
184 from Bio._py3k import _as_bytes, _bytes_to_string
185 from Bio.Alphabet import generic_dna
186 from Bio.SearchIO._index import SearchIndexer
187 from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment
188
189
190 __all__ = ['BlatPslParser', 'BlatPslIndexer', 'BlatPslWriter']
191
192
193
194 _PTR_ROW_CHECK = r'^\d+\s+\d+\s+\d+\s+\d+'
195 _RE_ROW_CHECK = re.compile(_PTR_ROW_CHECK)
196 _RE_ROW_CHECK_IDX = re.compile(_as_bytes(_PTR_ROW_CHECK))
197
198
200 """Transforms the given comma-separated string into a list.
201
202 Arguments:
203 csv_string -- Comma-separated string to transform.
204 caster -- Cast function to use on each list item.
205
206 """
207 filtered = (x for x in filter(None, csv_string.split(',')))
208 if caster is None:
209 return list(filtered)
210 else:
211 return [caster(x) for x in filtered]
212
213
215 """Reorients block starts into the opposite strand's coordinates.
216
217 Arguments:
218 starts -- List of integers, start coordinates.
219 start -- Integer, 'Q start' or 'T start' column
220 blksizes -- List of integers, block sizes.
221 seqlen -- Integer of total sequence length.
222 strand -- Integer denoting sequence strand.
223
224 """
225 assert len(starts) == len(blksizes), \
226 "Unequal start coordinates and block sizes list (%r vs %r)" \
227 % (len(starts), len(blksizes))
228
229
230 if strand >= 0:
231 return starts
232 else:
233
234
235 return [seqlen - start - blksize for
236 start, blksize in zip(starts, blksizes)]
237
238
240
241
242 if len(psl['strand']) == 2:
243 if psl['strand'][1] == '+':
244 return psl['tend'] == psl['tstarts'][-1] + \
245 3 * psl['blocksizes'][-1]
246 elif psl['strand'][1] == '-':
247 return psl['tstart'] == psl['tsize'] - \
248 (psl['tstarts'][-1] + 3 * psl['blocksizes'][-1])
249
250 return False
251
252
254
255
256 size_mul = 3 if is_protein else 1
257 millibad = 0
258
259 qali_size = size_mul * (psl['qend'] - psl['qstart'])
260 tali_size = psl['tend'] - psl['tstart']
261 ali_size = min(qali_size, tali_size)
262 if ali_size <= 0:
263 return 0
264
265 size_dif = qali_size - tali_size
266 size_dif = 0 if size_dif < 0 else size_dif
267
268 total = size_mul * (psl['matches'] + psl['repmatches'] + psl['mismatches'])
269 if total != 0:
270 millibad = (1000 * (psl['mismatches'] * size_mul + psl['qnuminsert'] +
271 round(3 * log(1 + size_dif)))) / total
272
273 return millibad
274
275
277
278
279 size_mul = 3 if is_protein else 1
280 return size_mul * (psl['matches'] + (psl['repmatches'] >> 1)) - \
281 size_mul * psl['mismatches'] - psl['qnuminsert'] - psl['tnuminsert']
282
283
285
286 is_protein = _is_protein(psl)
287
288
289 if is_protein:
290 qstrand = 0
291 else:
292 qstrand = 1 if psl['strand'][0] == '+' else -1
293
294 try:
295 hstrand = 1 if psl['strand'][1] == '+' else -1
296 except IndexError:
297 hstrand = 1
298
299
300 qstarts = _reorient_starts(psl['qstarts'],
301 psl['blocksizes'], psl['qsize'], qstrand)
302
303 if len(psl['strand']) == 2:
304 hstarts = _reorient_starts(psl['tstarts'],
305 psl['blocksizes'], psl['tsize'], hstrand)
306 else:
307 hstarts = psl['tstarts']
308
309
310 assert len(qstarts) == len(hstarts) == len(psl['blocksizes'])
311 query_range_all = zip(qstarts, [x + y for x, y in
312 zip(qstarts, psl['blocksizes'])])
313 hit_range_all = zip(hstarts, [x + y for x, y in
314 zip(hstarts, psl['blocksizes'])])
315
316 if 'tseqs' in psl and 'qseqs' in psl:
317 assert len(psl['tseqs']) == len(psl['qseqs']) == \
318 len(query_range_all) == len(hit_range_all)
319 else:
320 assert len(query_range_all) == len(hit_range_all)
321
322 frags = []
323
324 for idx, qcoords in enumerate(query_range_all):
325 hseqlist = psl.get('tseqs')
326 hseq = '' if not hseqlist else hseqlist[idx]
327 qseqlist = psl.get('qseqs')
328 qseq = '' if not qseqlist else qseqlist[idx]
329 frag = HSPFragment(hid, qid, hit=hseq, query=qseq)
330
331 frag.alphabet = generic_dna
332
333 frag.query_start = qcoords[0]
334 frag.query_end = qcoords[1]
335 frag.hit_start = hit_range_all[idx][0]
336 frag.hit_end = hit_range_all[idx][1]
337
338 frag.query_strand = qstrand
339 frag.hit_strand = hstrand
340 frags.append(frag)
341
342
343 hsp = HSP(frags)
344
345 assert hsp.query_start == psl['qstart']
346 assert hsp.query_end == psl['qend']
347 assert hsp.hit_start == psl['tstart']
348 assert hsp.hit_end == psl['tend']
349
350 assert hsp.query_span_all == hsp.hit_span_all == psl['blocksizes']
351
352 hsp.match_num = psl['matches']
353 hsp.mismatch_num = psl['mismatches']
354 hsp.match_rep_num = psl['repmatches']
355 hsp.n_num = psl['ncount']
356 hsp.query_gapopen_num = psl['qnuminsert']
357 hsp.query_gap_num = psl['qbaseinsert']
358 hsp.hit_gapopen_num = psl['tnuminsert']
359 hsp.hit_gap_num = psl['tbaseinsert']
360
361 hsp.ident_num = psl['matches'] + psl['repmatches']
362 hsp.gapopen_num = psl['qnuminsert'] + psl['tnuminsert']
363 hsp.gap_num = psl['qbaseinsert'] + psl['tbaseinsert']
364 hsp.query_is_protein = is_protein
365 hsp.ident_pct = 100.0 - _calc_millibad(psl, is_protein) * 0.1
366 hsp.score = _calc_score(psl, is_protein)
367
368 hsp._has_hit_strand = len(psl['strand']) == 2
369
370 return hsp
371
372
374
375 """Parser for the BLAT PSL format."""
376
377 - def __init__(self, handle, pslx=False):
381
383
384 if not self.line:
385 raise StopIteration
386
387
388
389 while not re.search(_RE_ROW_CHECK, self.line.strip()):
390 self.line = self.handle.readline()
391 if not self.line:
392 raise StopIteration
393
394
395 for qresult in self._parse_qresult():
396 qresult.program = 'blat'
397 yield qresult
398
400 """Returns a dictionary of parsed column values."""
401 assert self.line
402 cols = filter(None, self.line.strip().split('\t'))
403 self._validate_cols(cols)
404
405 psl = {}
406 psl['qname'] = cols[9]
407 psl['qsize'] = int(cols[10])
408 psl['tname'] = cols[13]
409 psl['tsize'] = int(cols[14])
410 psl['matches'] = int(cols[0])
411 psl['mismatches'] = int(cols[1])
412 psl['repmatches'] = int(cols[2])
413 psl['ncount'] = int(cols[3])
414 psl['qnuminsert'] = int(cols[4])
415 psl['qbaseinsert'] = int(cols[5])
416 psl['tnuminsert'] = int(cols[6])
417 psl['tbaseinsert'] = int(cols[7])
418 psl['strand'] = cols[8]
419 psl['qstart'] = int(cols[11])
420 psl['qend'] = int(cols[12])
421 psl['tstart'] = int(cols[15])
422 psl['tend'] = int(cols[16])
423 psl['blockcount'] = int(cols[17])
424 psl['blocksizes'] = _list_from_csv(cols[18], int)
425 psl['qstarts'] = _list_from_csv(cols[19], int)
426 psl['tstarts'] = _list_from_csv(cols[20], int)
427 if self.pslx:
428 psl['qseqs'] = _list_from_csv(cols[21])
429 psl['tseqs'] = _list_from_csv(cols[22])
430
431 return psl
432
434 if not self.pslx:
435 assert len(cols) == 21, "Invalid PSL line: %r. " \
436 "Expected 21 tab-separated columns, found %i" % (self.line, len(cols))
437 else:
438 assert len(cols) == 23, "Invalid PSLX line: %r. " \
439 "Expected 23 tab-separated columns, found %i" % (self.line, len(cols))
440
442 """Generator function that returns QueryResult objects."""
443
444 state_EOF = 0
445 state_QRES_NEW = 1
446 state_QRES_SAME = 3
447 state_HIT_NEW = 2
448 state_HIT_SAME = 4
449
450 qres_state = None
451 file_state = None
452 prev_qid, prev_hid = None, None
453 cur, prev = None, None
454 hit_list, hsp_list = [], []
455
456 while True:
457
458 if cur is not None:
459 prev = cur
460 prev_qid = cur_qid
461 prev_hid = cur_hid
462
463 if self.line:
464 cur = self._parse_row()
465 cur_qid = cur['qname']
466 cur_hid = cur['tname']
467 else:
468 file_state = state_EOF
469
470 cur_qid, cur_hid = None, None
471
472
473 if prev_qid != cur_qid:
474 qres_state = state_QRES_NEW
475 else:
476 qres_state = state_QRES_SAME
477
478 if prev_hid != cur_hid or qres_state == state_QRES_NEW:
479 hit_state = state_HIT_NEW
480 else:
481 hit_state = state_HIT_SAME
482
483 if prev is not None:
484
485 hsp = _create_hsp(prev_hid, prev_qid, prev)
486 hsp_list.append(hsp)
487
488 if hit_state == state_HIT_NEW:
489
490 hit = Hit(hsp_list)
491 hit.seq_len = prev['tsize']
492 hit_list.append(hit)
493 hsp_list = []
494
495
496 if qres_state == state_QRES_NEW or file_state == state_EOF:
497 qresult = QueryResult(prev_qid)
498 for hit in hit_list:
499 qresult.absorb(hit)
500 qresult.seq_len = prev['qsize']
501 yield qresult
502
503 if file_state == state_EOF:
504 break
505 hit_list = []
506
507 self.line = self.handle.readline()
508
509
511
512 """Indexer class for BLAT PSL output."""
513
514 _parser = BlatPslParser
515
516 - def __init__(self, filename, pslx=False):
518
520 """Iterates over the file handle; yields key, start offset, and length."""
521 handle = self._handle
522 handle.seek(0)
523
524 query_id_idx = 9
525 qresult_key = None
526 tab_char = _as_bytes('\t')
527
528 start_offset = handle.tell()
529 line = handle.readline()
530
531
532 while not re.search(_RE_ROW_CHECK_IDX, line.strip()):
533 start_offset = handle.tell()
534 line = handle.readline()
535 if not line:
536 raise StopIteration
537
538
539 while True:
540 end_offset = handle.tell()
541
542 cols = line.strip().split(tab_char)
543 if qresult_key is None:
544 qresult_key = list(filter(None, cols))[query_id_idx]
545 else:
546 curr_key = list(filter(None, cols))[query_id_idx]
547
548 if curr_key != qresult_key:
549 yield _bytes_to_string(qresult_key), start_offset, \
550 end_offset - start_offset
551 qresult_key = curr_key
552 start_offset = end_offset - len(line)
553
554 line = handle.readline()
555 if not line:
556 yield _bytes_to_string(qresult_key), start_offset, \
557 end_offset - start_offset
558 break
559
561 """Returns the raw string of a QueryResult object from the given offset."""
562 handle = self._handle
563 handle.seek(offset)
564 query_id_idx = 9
565 qresult_key = None
566 qresult_raw = _as_bytes('')
567 tab_char = _as_bytes('\t')
568
569 while True:
570 line = handle.readline()
571 if not line:
572 break
573 cols = line.strip().split(tab_char)
574 if qresult_key is None:
575 qresult_key = list(filter(None, cols))[query_id_idx]
576 else:
577 curr_key = list(filter(None, cols))[query_id_idx]
578 if curr_key != qresult_key:
579 break
580 qresult_raw += line
581
582 return qresult_raw
583
584
586
587 """Writer for the blat-psl format."""
588
589 - def __init__(self, handle, header=False, pslx=False):
594
596 handle = self.handle
597 qresult_counter, hit_counter, hsp_counter, frag_counter = 0, 0, 0, 0
598
599 if self.header:
600 handle.write(self._build_header())
601
602 for qresult in qresults:
603 if qresult:
604 handle.write(self._build_row(qresult))
605 qresult_counter += 1
606 hit_counter += len(qresult)
607 hsp_counter += sum([len(hit) for hit in qresult])
608 frag_counter += sum([len(hit.fragments) for hit in qresult])
609
610 return qresult_counter, hit_counter, hsp_counter, frag_counter
611
613
614 header = 'psLayout version 3\n'
615
616
617 header += "\nmatch\tmis- \trep. \tN's\tQ gap\tQ gap\tT gap\tT "
618 "gap\tstrand\tQ \tQ \tQ \tQ \tT \tT \tT "
619 "\tT \tblock\tblockSizes \tqStarts\t tStarts\n " \
620 "\tmatch\tmatch\t \tcount\tbases\tcount\tbases\t \tname "
621 "\tsize\tstart\tend\tname \tsize\tstart\tend\tcount"
622 "\n%s\n" % ('-' * 159)
623
624 return header
625
627 """Returns a string or one row or more of the QueryResult object."""
628
629
630
631
632
633 qresult_lines = []
634
635 for hit in qresult:
636 for hsp in hit.hsps:
637
638 line = []
639 line.append(hsp.match_num)
640 line.append(hsp.mismatch_num)
641 line.append(hsp.match_rep_num)
642 line.append(hsp.n_num)
643 line.append(hsp.query_gapopen_num)
644 line.append(hsp.query_gap_num)
645 line.append(hsp.hit_gapopen_num)
646 line.append(hsp.hit_gap_num)
647
648
649 assert hsp.query_span_all == hsp.hit_span_all
650 block_sizes = hsp.query_span_all
651
652
653 if hsp[0].query_strand >= 0:
654 strand = '+'
655 else:
656 strand = '-'
657 qstarts = _reorient_starts([x[0] for x in hsp.query_range_all],
658 hsp.query_span_all, qresult.seq_len, hsp[0].query_strand)
659
660 if hsp[0].hit_strand == 1:
661 hstrand = 1
662
663 if hsp._has_hit_strand:
664 strand += '+'
665 else:
666 hstrand = -1
667 strand += '-'
668 hstarts = _reorient_starts([x[0] for x in hsp.hit_range_all],
669 hsp.hit_span_all, hit.seq_len, hstrand)
670
671 line.append(strand)
672 line.append(qresult.id)
673 line.append(qresult.seq_len)
674 line.append(hsp.query_start)
675 line.append(hsp.query_end)
676 line.append(hit.id)
677 line.append(hit.seq_len)
678 line.append(hsp.hit_start)
679 line.append(hsp.hit_end)
680 line.append(len(hsp))
681 line.append(','.join((str(x) for x in block_sizes)) + ',')
682 line.append(','.join((str(x) for x in qstarts)) + ',')
683 line.append(','.join((str(x) for x in hstarts)) + ',')
684
685 if self.pslx:
686 line.append(','.join((str(x.seq) for x in hsp.query_all)) + ',')
687 line.append(','.join((str(x.seq) for x in hsp.hit_all)) + ',')
688
689 qresult_lines.append('\t'.join((str(x) for x in line)))
690
691 return '\n'.join(qresult_lines) + '\n'
692
693
694
695 if __name__ == "__main__":
696 from Bio._utils import run_doctest
697 run_doctest()
698