Package Bio :: Package SearchIO :: Module BlatIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SearchIO.BlatIO

  1  # Copyright 2012 by Wibowo Arindrarto.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Bio.SearchIO parser for BLAT output formats. 
  7   
  8  This module adds support for parsing BLAT outputs. BLAT (BLAST-Like Alignment 
  9  Tool) is a sequence homology search program initially built for annotating 
 10  the human genome. 
 11   
 12  Bio.SearchIO.BlastIO was tested using standalone BLAT version 34, psLayout 
 13  version 3. It should be able to parse psLayout version 4 without problems. 
 14   
 15  More information on BLAT is available from these sites: 
 16   
 17    - Publication: http://genome.cshlp.org/content/12/4/656 
 18    - User guide: http://genome.ucsc.edu/goldenPath/help/blatSpec.html 
 19    - Source download: http://www.soe.ucsc.edu/~kent/src 
 20    - Executable download: http://hgdownload.cse.ucsc.edu/admin/exe/ 
 21    - Blat score calculation: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 
 22   
 23   
 24  Supported Formats 
 25  ================= 
 26   
 27  BlatIO supports parsing, indexing, and writing for both PSL and PSLX output 
 28  formats, with or without header. To parse, index, or write PSLX files, use the 
 29  'pslx' keyword argument and set it to True. 
 30   
 31      # blat-psl defaults to PSL files 
 32      >>> from Bio import SearchIO 
 33      >>> psl = 'Blat/psl_34_004.psl' 
 34      >>> qresult = SearchIO.read(psl, 'blat-psl') 
 35      >>> qresult 
 36      QueryResult(id='hg19_dna', 10 hits) 
 37   
 38      # set the pslx flag to parse PSLX files 
 39      >>> pslx = 'Blat/pslx_34_004.pslx' 
 40      >>> qresult = SearchIO.read(pslx, 'blat-psl', pslx=True) 
 41      >>> qresult 
 42      QueryResult(id='hg19_dna', 10 hits) 
 43   
 44  For parsing and indexing, you do not need to specify whether the file has a 
 45  header or not. For writing, if you want to write a header, you can set the 
 46  'header' keyword argument to True. This will write a 'psLayout version 3' header 
 47  to your output file. 
 48   
 49      from Bio import SearchIO 
 50      qresult = SearchIO.read(psl, 'blat-psl') 
 51      SearchIO.write(qresult, 'header.psl', header=True) 
 52      <stdout> (1, 10, 19, 23) 
 53   
 54  Note that the number of HSPFragments written may exceed the number of HSP 
 55  objects. This is because in PSL files, it is possible to have single matches 
 56  consisting of noncontiguous sequence fragments. This is where the HSPFragment 
 57  object comes into play. These fragments are grouped into a single HSP because 
 58  they share the same statistics (e.g. match numbers, BLAT score, etc.). However, 
 59  they do not share the same sequence attributes, such as the start and end 
 60  coordinates, making them distinct objects. 
 61   
 62  In addition to parsing PSL(X) files, BlatIO also computes the percent identities 
 63  and scores of your search results. This is done using the calculation formula 
 64  posted here: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4. It mimics the score 
 65  and percent identity calculation done by UCSC's web BLAT service. 
 66   
 67  Since BlatIO parses the file in a single pass, it expects all results from 
 68  the same query to be in consecutive rows. If the results from one query are 
 69  spread in nonconsecutive rows, BlatIO will consider them to be separate 
 70  QueryResult objects. 
 71   
 72  In most cases, the PSL(X) format uses the same coordinate system as Python 
 73  (zero-based, half open). These coordinates are anchored on the plus strand. 
 74  However, if the query aligns on the minus strand, BLAT will anchor the qStarts 
 75  coordinates on the minus strand instead. BlatIO is aware of this, and will 
 76  re-anchor the qStarts coordinates to the plus strand whenever it sees a minus 
 77  strand query match. Conversely, when you write out to a PSL(X) file, BlatIO will 
 78  reanchor qStarts to the minus strand again. 
 79   
 80  BlatIO provides the following attribute-column mapping: 
 81   
 82  +----------------+-------------------------+-----------------------------------+ 
 83  | Object         | Attribute               | Column Name, Value                | 
 84  +================+=========================+===================================+ 
 85  | QueryResutl    | id                      | Q name, query sequence ID         | 
 86  |                +-------------------------+-----------------------------------+ 
 87  |                | seq_len                 | Q size, query sequence full       | 
 88  |                |                         | length                            | 
 89  +----------------+-------------------------+-----------------------------------+ 
 90  | Hit            | id                      | T name, hit sequence ID           | 
 91  |                +-------------------------+-----------------------------------+ 
 92  |                | seq_len                 | T size, hit sequence full length  | 
 93  +----------------+-------------------------+-----------------------------------+ 
 94  | HSP            | hit_end                 | T end, end coordinate of the last | 
 95  |                |                         | hit fragment                      | 
 96  |                +-------------------------+-----------------------------------+ 
 97  |                | hit_gap_num             | T gap bases, number of bases      | 
 98  |                |                         | inserted in hit                   | 
 99  |                +-------------------------+-----------------------------------+ 
100  |                | hit_gapopen_num         | T gap count, number of hit gap    | 
101  |                |                         | inserts                           | 
102  |                +-------------------------+-----------------------------------+ 
103  |                | hit_span_all            | blockSizes, sizes of each         | 
104  |                |                         | fragment                          | 
105  |                +-------------------------+-----------------------------------+ 
106  |                | hit_start               | T start, start coordinate of the  | 
107  |                |                         | first hit fragment                | 
108  |                +-------------------------+-----------------------------------+ 
109  |                | hit_start_all           | tStarts, start coordinate of each | 
110  |                |                         | hit fragment                      | 
111  |                +-------------------------+-----------------------------------+ 
112  |                | match_num               | match, number of non-repeat       | 
113  |                |                         | matches                           | 
114  |                +-------------------------+-----------------------------------+ 
115  |                | mismatch_num            | mismatch, number of mismatches    | 
116  |                +-------------------------+-----------------------------------+ 
117  |                | match_rep_num           | rep. match, number of matches     | 
118  |                |                         | that are part of repeats          | 
119  |                +-------------------------+-----------------------------------+ 
120  |                | n_num                   | N's, number of N bases            | 
121  |                +-------------------------+-----------------------------------+ 
122  |                | query_end               | Q end, end coordinate of the last | 
123  |                +-------------------------+-----------------------------------+ 
124  |                |                         | query fragment                    | 
125  |                | query_gap_num           | Q gap bases, number of bases      | 
126  |                |                         | inserted in query                 | 
127  |                +-------------------------+-----------------------------------+ 
128  |                | query_gapopen_num       | Q gap count, number of query gap  | 
129  |                |                         | inserts                           | 
130  |                +-------------------------+-----------------------------------+ 
131  |                | query_span_all          | blockSizes, sizes of each         | 
132  |                |                         | fragment                          | 
133  |                +-------------------------+-----------------------------------+ 
134  |                | query_start             | Q start, start coordinate of the  | 
135  |                |                         | first query block                 | 
136  |                +-------------------------+-----------------------------------+ 
137  |                | query_start_all         | qStarts, start coordinate of each | 
138  |                |                         | query fragment                    | 
139  |                +-------------------------+-----------------------------------+ 
140  |                | `len`*                  | block count, the number of blocks | 
141  |                |                         | in the alignment                  | 
142  +----------------+-------------------------+-----------------------------------+ 
143  | HSPFragment    | hit                     | hit sequence, if present          | 
144  |                +-------------------------+-----------------------------------+ 
145  |                | hit_strand              | strand, hit sequence strand       | 
146  |                +-------------------------+-----------------------------------+ 
147  |                | query                   | query sequence, if present        | 
148  |                +-------------------------+-----------------------------------+ 
149  |                | query_strand            | strand, query sequence strand     | 
150  +----------------+-------------------------+-----------------------------------+ 
151  * You can obtain the number of blocks / fragments in the HSP by invoking `len` 
152    on the HSP 
153   
154  In addition to the column mappings above, BlatIO also provides the following 
155  object attributes: 
156   
157  +----------------+-------------------------+-----------------------------------+ 
158  | Object         | Attribute               | Value                             | 
159  +================+=========================+===================================+ 
160  | HSP            | gapopen_num             | Q gap count + T gap count, total  | 
161  |                |                         |  number of gap openings           | 
162  |                +-------------------------+-----------------------------------+ 
163  |                | ident_num               | matches + repmatches, total       | 
164  |                |                         | number of identical residues      | 
165  |                +-------------------------+-----------------------------------+ 
166  |                | ident_pct               | percent identity, calculated      | 
167  |                |                         | using UCSC's formula              | 
168  |                +-------------------------+-----------------------------------+ 
169  |                | query_is_protein        | boolean, whether the query        | 
170  |                |                         | sequence is a protein             | 
171  |                +-------------------------+-----------------------------------+ 
172  |                | score                   | HSP score, calculated using       | 
173  |                |                         | UCSC's formula                    | 
174  +----------------+-------------------------+-----------------------------------+ 
175   
176  Finally, the default HSP and HSPFragment properties are also provided. See the 
177  HSP and HSPFragment documentation for more details on these properties. 
178   
179  """ 
180  import re 
181  from math import log 
182   
183  from Bio._py3k import _as_bytes, _bytes_to_string 
184  from Bio._py3k import zip 
185   
186  from Bio.Alphabet import generic_dna 
187  from Bio.SearchIO._index import SearchIndexer 
188  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
189   
190   
191  __all__ = ['BlatPslParser', 'BlatPslIndexer', 'BlatPslWriter'] 
192   
193   
194  # precompile regex patterns 
195  _PTR_ROW_CHECK = r'^\d+\s+\d+\s+\d+\s+\d+' 
196  _RE_ROW_CHECK = re.compile(_PTR_ROW_CHECK) 
197  _RE_ROW_CHECK_IDX = re.compile(_as_bytes(_PTR_ROW_CHECK)) 
198   
199   
200 -def _list_from_csv(csv_string, caster=None):
201 """Transforms the given comma-separated string into a list. 202 203 Arguments: 204 csv_string -- Comma-separated string to transform. 205 caster -- Cast function to use on each list item. 206 207 """ 208 if caster is None: 209 return [x for x in csv_string.split(',') if x] 210 else: 211 return [caster(x) for x in csv_string.split(',') if x]
212 213
214 -def _reorient_starts(starts, blksizes, seqlen, strand):
215 """Reorients block starts into the opposite strand's coordinates. 216 217 Arguments: 218 starts -- List of integers, start coordinates. 219 start -- Integer, 'Q start' or 'T start' column 220 blksizes -- List of integers, block sizes. 221 seqlen -- Integer of total sequence length. 222 strand -- Integer denoting sequence strand. 223 224 """ 225 assert len(starts) == len(blksizes), \ 226 "Unequal start coordinates and block sizes list (%r vs %r)" \ 227 % (len(starts), len(blksizes)) 228 # see: http://genome.ucsc.edu/goldenPath/help/blatSpec.html 229 # no need to reorient if it's already the positive strand 230 if strand >= 0: 231 return starts 232 else: 233 # the plus-oriented coordinate is calculated by this: 234 # plus_coord = length - minus_coord - block_size 235 return [seqlen - start - blksize for 236 start, blksize in zip(starts, blksizes)]
237 238
239 -def _is_protein(psl):
240 # check if query is protein or not 241 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 242 if len(psl['strand']) == 2: 243 if psl['strand'][1] == '+': 244 return psl['tend'] == psl['tstarts'][-1] + \ 245 3 * psl['blocksizes'][-1] 246 elif psl['strand'][1] == '-': 247 return psl['tstart'] == psl['tsize'] - \ 248 (psl['tstarts'][-1] + 3 * psl['blocksizes'][-1]) 249 250 return False
251 252
253 -def _calc_millibad(psl, is_protein):
254 # calculates millibad 255 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 256 size_mul = 3 if is_protein else 1 257 millibad = 0 258 259 qali_size = size_mul * (psl['qend'] - psl['qstart']) 260 tali_size = psl['tend'] - psl['tstart'] 261 ali_size = min(qali_size, tali_size) 262 if ali_size <= 0: 263 return 0 264 265 size_dif = qali_size - tali_size 266 size_dif = 0 if size_dif < 0 else size_dif 267 268 total = size_mul * (psl['matches'] + psl['repmatches'] + psl['mismatches']) 269 if total != 0: 270 millibad = (1000 * (psl['mismatches'] * size_mul + psl['qnuminsert'] + 271 round(3 * log(1 + size_dif)))) / total 272 273 return millibad
274 275
276 -def _calc_score(psl, is_protein):
277 # calculates score 278 # adapted from http://genome.ucsc.edu/FAQ/FAQblat.html#blat4 279 size_mul = 3 if is_protein else 1 280 return size_mul * (psl['matches'] + (psl['repmatches'] >> 1)) - \ 281 size_mul * psl['mismatches'] - psl['qnuminsert'] - psl['tnuminsert']
282 283
284 -def _create_hsp(hid, qid, psl):
285 # protein flag 286 is_protein = _is_protein(psl) 287 # strand 288 #if query is protein, strand is 0 289 if is_protein: 290 qstrand = 0 291 else: 292 qstrand = 1 if psl['strand'][0] == '+' else -1 293 # try to get hit strand, if it exists 294 try: 295 hstrand = 1 if psl['strand'][1] == '+' else -1 296 except IndexError: 297 hstrand = 1 # hit strand defaults to plus 298 299 # query block starts 300 qstarts = _reorient_starts(psl['qstarts'], 301 psl['blocksizes'], psl['qsize'], qstrand) 302 # hit block starts 303 if len(psl['strand']) == 2: 304 hstarts = _reorient_starts(psl['tstarts'], 305 psl['blocksizes'], psl['tsize'], hstrand) 306 else: 307 hstarts = psl['tstarts'] 308 # set query and hit coords 309 # this assumes each block has no gaps (which seems to be the case) 310 assert len(qstarts) == len(hstarts) == len(psl['blocksizes']) 311 query_range_all = list(zip(qstarts, [x + y for x, y in 312 zip(qstarts, psl['blocksizes'])])) 313 hit_range_all = list(zip(hstarts, [x + y for x, y in 314 zip(hstarts, psl['blocksizes'])])) 315 # check length of sequences and coordinates, all must match 316 if 'tseqs' in psl and 'qseqs' in psl: 317 assert len(psl['tseqs']) == len(psl['qseqs']) == \ 318 len(query_range_all) == len(hit_range_all) 319 else: 320 assert len(query_range_all) == len(hit_range_all) 321 322 frags = [] 323 # iterating over query_range_all, but hit_range_all works just as well 324 for idx, qcoords in enumerate(query_range_all): 325 hseqlist = psl.get('tseqs') 326 hseq = '' if not hseqlist else hseqlist[idx] 327 qseqlist = psl.get('qseqs') 328 qseq = '' if not qseqlist else qseqlist[idx] 329 frag = HSPFragment(hid, qid, hit=hseq, query=qseq) 330 # set alphabet 331 frag.alphabet = generic_dna 332 # set coordinates 333 frag.query_start = qcoords[0] 334 frag.query_end = qcoords[1] 335 frag.hit_start = hit_range_all[idx][0] 336 frag.hit_end = hit_range_all[idx][1] 337 # and strands 338 frag.query_strand = qstrand 339 frag.hit_strand = hstrand 340 frags.append(frag) 341 342 # create hsp object 343 hsp = HSP(frags) 344 # check if start and end are set correctly 345 assert hsp.query_start == psl['qstart'] 346 assert hsp.query_end == psl['qend'] 347 assert hsp.hit_start == psl['tstart'] 348 assert hsp.hit_end == psl['tend'] 349 # and check block spans as well 350 assert hsp.query_span_all == hsp.hit_span_all == psl['blocksizes'] 351 # set its attributes 352 hsp.match_num = psl['matches'] 353 hsp.mismatch_num = psl['mismatches'] 354 hsp.match_rep_num = psl['repmatches'] 355 hsp.n_num = psl['ncount'] 356 hsp.query_gapopen_num = psl['qnuminsert'] 357 hsp.query_gap_num = psl['qbaseinsert'] 358 hsp.hit_gapopen_num = psl['tnuminsert'] 359 hsp.hit_gap_num = psl['tbaseinsert'] 360 361 hsp.ident_num = psl['matches'] + psl['repmatches'] 362 hsp.gapopen_num = psl['qnuminsert'] + psl['tnuminsert'] 363 hsp.gap_num = psl['qbaseinsert'] + psl['tbaseinsert'] 364 hsp.query_is_protein = is_protein 365 hsp.ident_pct = 100.0 - _calc_millibad(psl, is_protein) * 0.1 366 hsp.score = _calc_score(psl, is_protein) 367 # helper flag, for writing 368 hsp._has_hit_strand = len(psl['strand']) == 2 369 370 return hsp
371 372
373 -class BlatPslParser(object):
374 375 """Parser for the BLAT PSL format.""" 376
377 - def __init__(self, handle, pslx=False):
378 self.handle = handle 379 self.line = self.handle.readline() 380 self.pslx = pslx
381
382 - def __iter__(self):
383 # break out if it's an empty file 384 if not self.line: 385 raise StopIteration 386 387 # read through header 388 # this assumes that the result row match the regex 389 while not re.search(_RE_ROW_CHECK, self.line.strip()): 390 self.line = self.handle.readline() 391 if not self.line: 392 raise StopIteration 393 394 # parse into query results 395 for qresult in self._parse_qresult(): 396 qresult.program = 'blat' 397 yield qresult
398
399 - def _parse_row(self):
400 """Returns a dictionary of parsed column values.""" 401 assert self.line 402 cols = [x for x in self.line.strip().split('\t') if x] 403 self._validate_cols(cols) 404 405 psl = {} 406 psl['qname'] = cols[9] # qName 407 psl['qsize'] = int(cols[10]) # qSize 408 psl['tname'] = cols[13] # tName 409 psl['tsize'] = int(cols[14]) # tSize 410 psl['matches'] = int(cols[0]) # matches 411 psl['mismatches'] = int(cols[1]) # misMatches 412 psl['repmatches'] = int(cols[2]) # repMatches 413 psl['ncount'] = int(cols[3]) # nCount 414 psl['qnuminsert'] = int(cols[4]) # qNumInsert 415 psl['qbaseinsert'] = int(cols[5]) # qBaseInsert 416 psl['tnuminsert'] = int(cols[6]) # tNumInsert 417 psl['tbaseinsert'] = int(cols[7]) # tBaseInsert 418 psl['strand'] = cols[8] # strand 419 psl['qstart'] = int(cols[11]) # qStart 420 psl['qend'] = int(cols[12]) # qEnd 421 psl['tstart'] = int(cols[15]) # tStart 422 psl['tend'] = int(cols[16]) # tEnd 423 psl['blockcount'] = int(cols[17]) # blockCount 424 psl['blocksizes'] = _list_from_csv(cols[18], int) # blockSizes 425 psl['qstarts'] = _list_from_csv(cols[19], int) # qStarts 426 psl['tstarts'] = _list_from_csv(cols[20], int) # tStarts 427 if self.pslx: 428 psl['qseqs'] = _list_from_csv(cols[21]) # query sequence 429 psl['tseqs'] = _list_from_csv(cols[22]) # hit sequence 430 431 return psl
432
433 - def _validate_cols(self, cols):
434 if not self.pslx: 435 assert len(cols) == 21, "Invalid PSL line: %r. " \ 436 "Expected 21 tab-separated columns, found %i" % (self.line, len(cols)) 437 else: 438 assert len(cols) == 23, "Invalid PSLX line: %r. " \ 439 "Expected 23 tab-separated columns, found %i" % (self.line, len(cols))
440
441 - def _parse_qresult(self):
442 """Generator function that returns QueryResult objects.""" 443 # state values, determines what to do for each line 444 state_EOF = 0 445 state_QRES_NEW = 1 446 state_QRES_SAME = 3 447 state_HIT_NEW = 2 448 state_HIT_SAME = 4 449 # initial dummy values 450 qres_state = None 451 file_state = None 452 prev_qid, prev_hid = None, None 453 cur, prev = None, None 454 hit_list, hsp_list = [], [] 455 456 while True: 457 # store previous line's parsed values for all lines after the first 458 if cur is not None: 459 prev = cur 460 prev_qid = cur_qid 461 prev_hid = cur_hid 462 # only parse the result row if it's not EOF 463 if self.line: 464 cur = self._parse_row() 465 cur_qid = cur['qname'] 466 cur_hid = cur['tname'] 467 else: 468 file_state = state_EOF 469 # mock values, since we have nothing to parse 470 cur_qid, cur_hid = None, None 471 472 # get the state of hit and qresult 473 if prev_qid != cur_qid: 474 qres_state = state_QRES_NEW 475 else: 476 qres_state = state_QRES_SAME 477 # new hits are hits with different ids or hits in a new qresult 478 if prev_hid != cur_hid or qres_state == state_QRES_NEW: 479 hit_state = state_HIT_NEW 480 else: 481 hit_state = state_HIT_SAME 482 483 if prev is not None: 484 # create fragment and HSP and set their attributes 485 hsp = _create_hsp(prev_hid, prev_qid, prev) 486 hsp_list.append(hsp) 487 488 if hit_state == state_HIT_NEW: 489 # create Hit and set its attributes 490 hit = Hit(hsp_list) 491 hit.seq_len = prev['tsize'] 492 hit_list.append(hit) 493 hsp_list = [] 494 495 # create qresult and yield if we're at a new qresult or at EOF 496 if qres_state == state_QRES_NEW or file_state == state_EOF: 497 qresult = QueryResult(id=prev_qid) 498 for hit in hit_list: 499 qresult.absorb(hit) 500 qresult.seq_len = prev['qsize'] 501 yield qresult 502 # if we're at EOF, break 503 if file_state == state_EOF: 504 break 505 hit_list = [] 506 507 self.line = self.handle.readline()
508 509
510 -class BlatPslIndexer(SearchIndexer):
511 512 """Indexer class for BLAT PSL output.""" 513 514 _parser = BlatPslParser 515
516 - def __init__(self, filename, pslx=False):
517 SearchIndexer.__init__(self, filename, pslx=pslx)
518
519 - def __iter__(self):
520 """Iterates over the file handle; yields key, start offset, and length.""" 521 handle = self._handle 522 handle.seek(0) 523 # denotes column location for query identifier 524 query_id_idx = 9 525 qresult_key = None 526 tab_char = _as_bytes('\t') 527 528 start_offset = handle.tell() 529 line = handle.readline() 530 # read through header 531 # this assumes that the result row match the regex 532 while not re.search(_RE_ROW_CHECK_IDX, line.strip()): 533 start_offset = handle.tell() 534 line = handle.readline() 535 if not line: 536 raise StopIteration 537 538 # and index the qresults 539 while True: 540 end_offset = handle.tell() 541 542 cols = [x for x in line.strip().split(tab_char) if x] 543 if qresult_key is None: 544 qresult_key = cols[query_id_idx] 545 else: 546 curr_key = cols[query_id_idx] 547 548 if curr_key != qresult_key: 549 yield _bytes_to_string(qresult_key), start_offset, \ 550 end_offset - start_offset 551 qresult_key = curr_key 552 start_offset = end_offset - len(line) 553 554 line = handle.readline() 555 if not line: 556 yield _bytes_to_string(qresult_key), start_offset, \ 557 end_offset - start_offset 558 break
559
560 - def get_raw(self, offset):
561 """Returns the raw string of a QueryResult object from the given offset.""" 562 handle = self._handle 563 handle.seek(offset) 564 query_id_idx = 9 565 qresult_key = None 566 qresult_raw = _as_bytes('') 567 tab_char = _as_bytes('\t') 568 569 while True: 570 line = handle.readline() 571 if not line: 572 break 573 cols = [x for x in line.strip().split(tab_char) if x] 574 if qresult_key is None: 575 qresult_key = cols[query_id_idx] 576 else: 577 curr_key = cols[query_id_idx] 578 if curr_key != qresult_key: 579 break 580 qresult_raw += line 581 582 return qresult_raw
583 584
585 -class BlatPslWriter(object):
586 587 """Writer for the blat-psl format.""" 588
589 - def __init__(self, handle, header=False, pslx=False):
590 self.handle = handle 591 # flag for writing header or not 592 self.header = header 593 self.pslx = pslx
594
595 - def write_file(self, qresults):
596 handle = self.handle 597 qresult_counter, hit_counter, hsp_counter, frag_counter = 0, 0, 0, 0 598 599 if self.header: 600 handle.write(self._build_header()) 601 602 for qresult in qresults: 603 if qresult: 604 handle.write(self._build_row(qresult)) 605 qresult_counter += 1 606 hit_counter += len(qresult) 607 hsp_counter += sum(len(hit) for hit in qresult) 608 frag_counter += sum(len(hit.fragments) for hit in qresult) 609 610 return qresult_counter, hit_counter, hsp_counter, frag_counter
611
612 - def _build_header(self):
613 # for now, always use the psLayout version 3 614 header = 'psLayout version 3\n' 615 616 # adapted from BLAT's source: lib/psl.c#L496 617 header += "\nmatch\tmis- \trep. \tN's\tQ gap\tQ gap\tT gap\tT " 618 "gap\tstrand\tQ \tQ \tQ \tQ \tT \tT \tT " 619 "\tT \tblock\tblockSizes \tqStarts\t tStarts\n " \ 620 "\tmatch\tmatch\t \tcount\tbases\tcount\tbases\t \tname " 621 "\tsize\tstart\tend\tname \tsize\tstart\tend\tcount" 622 "\n%s\n" % ('-' * 159) 623 624 return header
625
626 - def _build_row(self, qresult):
627 """Returns a string or one row or more of the QueryResult object.""" 628 # For now, our writer writes the row according to the order in 629 # the QueryResult and Hit objects. 630 # This is different from BLAT's native output, where the rows are 631 # grouped by strand. 632 # Should we tweak the behavior to better mimic the native output? 633 qresult_lines = [] 634 635 for hit in qresult: 636 for hsp in hit.hsps: 637 638 line = [] 639 line.append(hsp.match_num) 640 line.append(hsp.mismatch_num) 641 line.append(hsp.match_rep_num) 642 line.append(hsp.n_num) 643 line.append(hsp.query_gapopen_num) 644 line.append(hsp.query_gap_num) 645 line.append(hsp.hit_gapopen_num) 646 line.append(hsp.hit_gap_num) 647 648 # check spans 649 assert hsp.query_span_all == hsp.hit_span_all 650 block_sizes = hsp.query_span_all 651 652 # set strand and starts 653 if hsp[0].query_strand >= 0: # since it may be a protein seq 654 strand = '+' 655 else: 656 strand = '-' 657 qstarts = _reorient_starts([x[0] for x in hsp.query_range_all], 658 hsp.query_span_all, qresult.seq_len, hsp[0].query_strand) 659 660 if hsp[0].hit_strand == 1: 661 hstrand = 1 662 # only write hit strand if it was present in the source file 663 if hsp._has_hit_strand: 664 strand += '+' 665 else: 666 hstrand = -1 667 strand += '-' 668 hstarts = _reorient_starts([x[0] for x in hsp.hit_range_all], 669 hsp.hit_span_all, hit.seq_len, hstrand) 670 671 line.append(strand) 672 line.append(qresult.id) 673 line.append(qresult.seq_len) 674 line.append(hsp.query_start) 675 line.append(hsp.query_end) 676 line.append(hit.id) 677 line.append(hit.seq_len) 678 line.append(hsp.hit_start) 679 line.append(hsp.hit_end) 680 line.append(len(hsp)) 681 line.append(','.join((str(x) for x in block_sizes)) + ',') 682 line.append(','.join((str(x) for x in qstarts)) + ',') 683 line.append(','.join((str(x) for x in hstarts)) + ',') 684 685 if self.pslx: 686 line.append(','.join((str(x.seq) for x in hsp.query_all)) + ',') 687 line.append(','.join((str(x.seq) for x in hsp.hit_all)) + ',') 688 689 qresult_lines.append('\t'.join((str(x) for x in line))) 690 691 return '\n'.join(qresult_lines) + '\n'
692 693 694 # if not used as a module, run the doctest 695 if __name__ == "__main__": 696 from Bio._utils import run_doctest 697 run_doctest() 698