Package Bio :: Package SearchIO :: Module FastaIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SearchIO.FastaIO

  1  # Adapted from Bio.AlignIO.FastaIO copyright 2008-2011 by Peter Cock. 
  2  # Copyright 2012 by Wibowo Arindrarto. 
  3  # All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7   
  8  """Bio.SearchIO support for Bill Pearson's FASTA tools. 
  9   
 10  This module adds support for parsing FASTA outputs. FASTA is a suite of 
 11  programs that finds regions of local or global similarity between protein 
 12  or nucleotide sequences, either by searching databases or identifying 
 13  local duplications. 
 14   
 15  Bio.SearchIO.FastaIO was tested on the following FASTA flavors and versions: 
 16   
 17      - flavors: fasta, ssearch, tfastx 
 18      - versions: 35, 36 
 19   
 20  Other flavors and/or versions may introduce some bugs. Please file a bug report 
 21  if you see such problems to Biopython's bug tracker. 
 22   
 23  More information on FASTA are available through these links: 
 24   
 25      - Website: http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml 
 26      - User guide: http://fasta.bioch.virginia.edu/fasta_www2/fasta_guide.pdf 
 27   
 28   
 29  Supported Formats 
 30  ================= 
 31   
 32  Bio.SearchIO.FastaIO supports parsing and indexing FASTA outputs triggered by 
 33  the -m 10 flag. Other formats that mimic other programs (e.g. the BLAST tabular 
 34  format using the -m 8 flag) may be parseable but using SearchIO's other parsers 
 35  (in this case, using the 'blast-tab' parser). 
 36   
 37   
 38  fasta-m10 
 39  ========= 
 40   
 41  Note that in FASTA -m 10 outputs, HSPs from different strands are considered to 
 42  be from different hits. They are listed as two separate entries in the hit 
 43  table. FastaIO recognizes this and will group HSPs with the same hit ID into a 
 44  single Hit object, regardless of strand. 
 45   
 46  FASTA also sometimes output extra sequences adjacent to the HSP match. These 
 47  extra sequences are discarded by FastaIO. Only regions containing the actual 
 48  sequence match are extracted. 
 49   
 50  The following object attributes are provided: 
 51   
 52  +-----------------+-------------------------+----------------------------------+ 
 53  | Object          | Attribute               | Value                            | 
 54  +=================+=========================+==================================+ 
 55  | QueryResult     | description             | query sequence description       | 
 56  |                 +-------------------------+----------------------------------+ 
 57  |                 | id                      | query sequence ID                | 
 58  |                 +-------------------------+----------------------------------+ 
 59  |                 | program                 | FASTA flavor                     | 
 60  |                 +-------------------------+----------------------------------+ 
 61  |                 | seq_len                 | full length of query sequence    | 
 62  |                 +-------------------------+----------------------------------+ 
 63  |                 | target                  | target search database           | 
 64  |                 +-------------------------+----------------------------------+ 
 65  |                 | version                 | FASTA version                    | 
 66  +-----------------+-------------------------+----------------------------------+ 
 67  | Hit             | seq_len                 | full length of the hit sequence  | 
 68  +-----------------+-------------------------+----------------------------------+ 
 69  | HSP             | bitscore                | \*_bits line                     | 
 70  |                 +-------------------------+----------------------------------+ 
 71  |                 | evalue                  | \*_expect line                   | 
 72  |                 +-------------------------+----------------------------------+ 
 73  |                 | ident_pct               | \*_ident line                    | 
 74  |                 +-------------------------+----------------------------------+ 
 75  |                 | init1_score             | \*_init1 line                    | 
 76  |                 +-------------------------+----------------------------------+ 
 77  |                 | initn_score             | \*_initn line                    | 
 78  |                 +-------------------------+----------------------------------+ 
 79  |                 | opt_score               | \*_opt line, \*_s-w opt line     | 
 80  |                 +-------------------------+----------------------------------+ 
 81  |                 | pos_pct                 | \*_sim line                      | 
 82  |                 +-------------------------+----------------------------------+ 
 83  |                 | sw_score                | \*_score line                    | 
 84  |                 +-------------------------+----------------------------------+ 
 85  |                 | z_score                 | \*_z-score line                  | 
 86  +-----------------+-------------------------+----------------------------------+ 
 87  | HSPFragment     | aln_annotation          | al_cons block, if present        | 
 88  | (also via HSP)  +-------------------------+----------------------------------+ 
 89  |                 | hit                     | hit sequence                     | 
 90  |                 +-------------------------+----------------------------------+ 
 91  |                 | hit_end                 | hit sequence end coordinate      | 
 92  |                 +-------------------------+----------------------------------+ 
 93  |                 | hit_start               | hit sequence start coordinate    | 
 94  |                 +-------------------------+----------------------------------+ 
 95  |                 | hit_strand              | hit sequence strand              | 
 96  |                 +-------------------------+----------------------------------+ 
 97  |                 | query                   | query sequence                   | 
 98  |                 +-------------------------+----------------------------------+ 
 99  |                 | query_end               | query sequence end coordinate    | 
100  |                 +-------------------------+----------------------------------+ 
101  |                 | query_start             | query sequence start coordinate  | 
102  |                 +-------------------------+----------------------------------+ 
103  |                 | query_strand            | query sequence strand            | 
104  +-----------------+-------------------------+----------------------------------+ 
105   
106  """ 
107   
108  import re 
109   
110  from Bio._py3k import _as_bytes, _bytes_to_string 
111  from Bio.Alphabet import generic_dna, generic_protein 
112  from Bio.File import UndoHandle 
113  from Bio.SearchIO._index import SearchIndexer 
114  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
115   
116   
117  __all__ = ['FastaM10Parser', 'FastaM10Indexer'] 
118   
119  __docformat__ = "restructuredtext en" 
120   
121   
122  # precompile regex patterns 
123  # regex for program name 
124  _RE_FLAVS = re.compile(r't?fast[afmsxy]|pr[sf][sx]|lalign|[gs]?[glso]search') 
125  # regex for sequence ID and length ~ deals with both \n and \r\n 
126  _PTR_ID_DESC_SEQLEN = r'>>>(.+?)\s+(.*?) *- (\d+) (?:aa|nt)\s*$' 
127  _RE_ID_DESC_SEQLEN = re.compile(_PTR_ID_DESC_SEQLEN) 
128  _RE_ID_DESC_SEQLEN_IDX = re.compile(_as_bytes(_PTR_ID_DESC_SEQLEN)) 
129  # regex for qresult, hit, or hsp attribute value 
130  _RE_ATTR = re.compile(r'^; [a-z]+(_[ \w-]+):\s+(.*)$') 
131  # regex for capturing excess start and end sequences in alignments 
132  _RE_START_EXC = re.compile(r'^-*') 
133  _RE_END_EXC = re.compile(r'-*$') 
134   
135  # attribute name mappings 
136  _HSP_ATTR_MAP = { 
137      '_initn': ('initn_score', int), 
138      '_init1': ('init1_score', int), 
139      '_opt': ('opt_score', int), 
140      '_s-w opt': ('opt_score', int), 
141      '_z-score': ('z_score', float), 
142      '_bits': ('bitscore', float), 
143      '_expect': ('evalue', float), 
144      '_score': ('sw_score', int), 
145      '_ident': ('ident_pct', float), 
146      '_sim': ('pos_pct', float), 
147  } 
148   
149  # state flags 
150  _STATE_NONE = 0 
151  _STATE_QUERY_BLOCK = 1 
152  _STATE_HIT_BLOCK = 2 
153  _STATE_CONS_BLOCK = 3 
154   
155   
156 -def _set_qresult_hits(qresult, hit_rows=[]):
157 """Helper function for appending Hits without alignments into QueryResults.""" 158 for hit_row in hit_rows: 159 hit_id, remainder = hit_row.split(' ', 1) 160 # TODO: parse hit and hsp properties properly; by dealing with: 161 # - any character in the description (brackets, spaces, etc.) 162 # - possible [f] or [r] presence (for frame info) 163 # - possible presence of E2() column 164 # - possible incomplete hit_id due to column length limit 165 # The current method only looks at the Hit ID, none of the things above 166 if hit_id not in qresult: 167 frag = HSPFragment(hit_id, qresult.id) 168 hsp = HSP([frag]) 169 hit = Hit([hsp]) 170 qresult.append(hit) 171 172 return qresult
173 174
175 -def _set_hsp_seqs(hsp, parsed, program):
176 """Helper function for the main parsing code. 177 178 :param hsp: HSP whose properties will be set 179 :type hsp: HSP 180 :param parsed: parsed values of the HSP attributes 181 :type parsed: dictionary {string: object} 182 :param program: program name 183 :type program: string 184 185 """ 186 # get aligned sequences and check if they have equal lengths 187 start = 0 188 for seq_type in ('hit', 'query'): 189 if 'tfast' not in program: 190 pseq = parsed[seq_type] 191 # adjust start and end coordinates based on the amount of 192 # filler characters 193 start, stop = _get_aln_slice_coords(pseq) 194 start_adj = len(re.search(_RE_START_EXC, pseq['seq']).group(0)) 195 stop_adj = len(re.search(_RE_END_EXC, pseq['seq']).group(0)) 196 start = start + start_adj 197 stop = stop + start_adj - stop_adj 198 parsed[seq_type]['seq'] = pseq['seq'][start:stop] 199 assert len(parsed['query']['seq']) == len(parsed['hit']['seq']), "%r %r" \ 200 % (len(parsed['query']['seq']), len(parsed['hit']['seq'])) 201 if 'similarity' in hsp.aln_annotation: 202 # only using 'start' since FASTA seems to have trimmed the 'excess' 203 # end part 204 hsp.aln_annotation['similarity'] = hsp.aln_annotation['similarity'][start:] 205 # hit or query works equally well here 206 assert len(hsp.aln_annotation['similarity']) == len(parsed['hit']['seq']) 207 208 # query and hit sequence types must be the same 209 assert parsed['query']['_type'] == parsed['hit']['_type'] 210 type_val = parsed['query']['_type'] # hit works fine too 211 alphabet = generic_dna if type_val == 'D' else generic_protein 212 setattr(hsp.fragment, 'alphabet', alphabet) 213 214 for seq_type in ('hit', 'query'): 215 # get and set start and end coordinates 216 start = int(parsed[seq_type]['_start']) 217 end = int(parsed[seq_type]['_stop']) 218 219 setattr(hsp.fragment, seq_type + '_start', min(start, end) - 1) 220 setattr(hsp.fragment, seq_type + '_end', max(start, end)) 221 # set seq and alphabet 222 setattr(hsp.fragment, seq_type, parsed[seq_type]['seq']) 223 224 if alphabet is not generic_protein: 225 # get strand from coordinate; start <= end is plus 226 # start > end is minus 227 if start <= end: 228 setattr(hsp.fragment, seq_type + '_strand', 1) 229 else: 230 setattr(hsp.fragment, seq_type + '_strand', -1) 231 else: 232 setattr(hsp.fragment, seq_type + '_strand', 0)
233 234
235 -def _get_aln_slice_coords(parsed_hsp):
236 """Helper function for the main parsing code. 237 238 To get the actual pairwise alignment sequences, we must first 239 translate the un-gapped sequence based coordinates into positions 240 in the gapped sequence (which may have a flanking region shown 241 using leading - characters). To date, I have never seen any 242 trailing flanking region shown in the m10 file, but the 243 following code should also cope with that. 244 245 Note that this code seems to work fine even when the "sq_offset" 246 entries are prsent as a result of using the -X command line option. 247 """ 248 seq = parsed_hsp['seq'] 249 seq_stripped = seq.strip('-') 250 disp_start = int(parsed_hsp['_display_start']) 251 start = int(parsed_hsp['_start']) 252 stop = int(parsed_hsp['_stop']) 253 254 if start <= stop: 255 start = start - disp_start 256 stop = stop - disp_start + 1 257 else: 258 start = disp_start - start 259 stop = disp_start - stop + 1 260 stop += seq_stripped.count('-') 261 assert 0 <= start and start < stop and stop <= len(seq_stripped), \ 262 "Problem with sequence start/stop,\n%s[%i:%i]\n%s" \ 263 % (seq, start, stop, parsed_hsp) 264 return start, stop
265 266
267 -class FastaM10Parser(object):
268 """Parser for Bill Pearson's FASTA suite's -m 10 output.""" 269
270 - def __init__(self, handle, __parse_hit_table=False):
271 self.handle = UndoHandle(handle) 272 self._preamble = self._parse_preamble()
273
274 - def __iter__(self):
275 for qresult in self._parse_qresult(): 276 # re-set desc, for hsp query description 277 qresult.description = qresult.description 278 yield qresult
279
280 - def _parse_preamble(self):
281 """Parses the Fasta preamble for Fasta flavor and version.""" 282 preamble = {} 283 while True: 284 self.line = self.handle.readline() 285 # this should be the line just before the first qresult 286 if self.line.startswith('Query'): 287 break 288 # try to match for version line 289 elif self.line.startswith(' version'): 290 preamble['version'] = self.line.split(' ')[2] 291 else: 292 # try to match for flavor line 293 flav_match = re.match(_RE_FLAVS, self.line.lower()) 294 if flav_match: 295 preamble['program'] = flav_match.group(0) 296 297 return preamble
298
299 - def __parse_hit_table(self):
300 """Parses hit table rows.""" 301 # move to the first row 302 self.line = self.handle.readline() 303 # parse hit table until we see an empty line 304 hit_rows = [] 305 while self.line and not self.line.strip(): 306 hit_rows.append(self.line.strip()) 307 self.line = self.handle.readline() 308 return hit_rows
309
310 - def _parse_qresult(self):
311 # initial qresult value 312 qresult = None 313 hit_rows = [] 314 # state values 315 state_QRES_NEW = 1 316 state_QRES_HITTAB = 3 317 state_QRES_CONTENT = 5 318 state_QRES_END = 7 319 320 while True: 321 322 # one line before the hit table 323 if self.line.startswith('The best scores are:'): 324 qres_state = state_QRES_HITTAB 325 # the end of a query or the file altogether 326 elif self.line.strip() == '>>>///' or not self.line: 327 qres_state = state_QRES_END 328 # the beginning of a new query 329 elif not self.line.startswith('>>>') and '>>>' in self.line: 330 qres_state = state_QRES_NEW 331 # the beginning of the query info and its hits + hsps 332 elif self.line.startswith('>>>') and not \ 333 self.line.strip() == '>>><<<': 334 qres_state = state_QRES_CONTENT 335 # default qres mark 336 else: 337 qres_state = None 338 339 if qres_state is not None: 340 if qres_state == state_QRES_HITTAB: 341 # parse hit table if flag is set 342 hit_rows = self.__parse_hit_table() 343 344 elif qres_state == state_QRES_END: 345 yield _set_qresult_hits(qresult, hit_rows) 346 break 347 348 elif qres_state == state_QRES_NEW: 349 # if qresult is filled, yield it first 350 if qresult is not None: 351 yield _set_qresult_hits(qresult, hit_rows) 352 regx = re.search(_RE_ID_DESC_SEQLEN, self.line) 353 query_id = regx.group(1) 354 seq_len = regx.group(3) 355 desc = regx.group(2) 356 qresult = QueryResult(id=query_id) 357 qresult.seq_len = int(seq_len) 358 # get target from the next line 359 self.line = self.handle.readline() 360 qresult.target = [x for x in self.line.split(' ') if x][1].strip() 361 if desc is not None: 362 qresult.description = desc 363 # set values from preamble 364 for key, value in self._preamble.items(): 365 setattr(qresult, key, value) 366 367 elif qres_state == state_QRES_CONTENT: 368 assert self.line[3:].startswith(qresult.id), self.line 369 for hit, strand in self._parse_hit(query_id): 370 # HACK: re-set desc, for hsp hit and query description 371 hit.description = hit.description 372 hit.query_description = qresult.description 373 # if hit is not in qresult, append it 374 if hit.id not in qresult: 375 qresult.append(hit) 376 # otherwise, it might be the same hit with a different strand 377 else: 378 # make sure strand is different and then append hsp to 379 # existing hit 380 for hsp in hit.hsps: 381 assert strand != hsp.query_strand 382 qresult[hit.id].append(hsp) 383 384 self.line = self.handle.readline()
385
386 - def _parse_hit(self, query_id):
387 while True: 388 self.line = self.handle.readline() 389 if self.line.startswith('>>'): 390 break 391 392 state = _STATE_NONE 393 strand = None 394 hsp_list = [] 395 while True: 396 peekline = self.handle.peekline() 397 # yield hit if we've reached the start of a new query or 398 # the end of the search 399 if peekline.strip() in [">>><<<", ">>>///"] or \ 400 (not peekline.startswith('>>>') and '>>>' in peekline): 401 # append last parsed_hsp['hit']['seq'] line 402 if state == _STATE_HIT_BLOCK: 403 parsed_hsp['hit']['seq'] += self.line.strip() 404 elif state == _STATE_CONS_BLOCK: 405 hsp.aln_annotation['similarity'] += \ 406 self.line.strip('\r\n') 407 # process HSP alignment and coordinates 408 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 409 hit = Hit(hsp_list) 410 hit.description = hit_desc 411 hit.seq_len = seq_len 412 yield hit, strand 413 hsp_list = [] 414 break 415 # yield hit and create a new one if we're still in the same query 416 elif self.line.startswith('>>'): 417 # try yielding, if we have hsps 418 if hsp_list: 419 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 420 hit = Hit(hsp_list) 421 hit.description = hit_desc 422 hit.seq_len = seq_len 423 yield hit, strand 424 hsp_list = [] 425 # try to get the hit id and desc, and handle cases without descs 426 try: 427 hit_id, hit_desc = self.line[2:].strip().split(' ', 1) 428 except ValueError: 429 hit_id = self.line[2:].strip().split(' ', 1)[0] 430 hit_desc = '' 431 # create the HSP object for Hit 432 frag = HSPFragment(hit_id, query_id) 433 hsp = HSP([frag]) 434 hsp_list.append(hsp) 435 # set or reset the state to none 436 state = _STATE_NONE 437 parsed_hsp = {'query': {}, 'hit': {}} 438 # create and append a new HSP if line starts with '>--' 439 elif self.line.startswith('>--'): 440 # set seq attributes of previous hsp 441 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 442 # and create a new one 443 frag = HSPFragment(hit_id, query_id) 444 hsp = HSP([frag]) 445 hsp_list.append(hsp) 446 # set the state ~ none yet 447 state = _STATE_NONE 448 parsed_hsp = {'query': {}, 'hit': {}} 449 # this is either query or hit data in the HSP, depending on the state 450 elif self.line.startswith('>'): 451 if state == _STATE_NONE: 452 # make sure it's the correct query 453 assert query_id.startswith(self.line[1:].split(' ')[0]), \ 454 "%r vs %r" % (query_id, self.line) 455 state = _STATE_QUERY_BLOCK 456 parsed_hsp['query']['seq'] = '' 457 elif state == _STATE_QUERY_BLOCK: 458 # make sure it's the correct hit 459 assert hit_id.startswith(self.line[1:].split(' ')[0]) 460 state = _STATE_HIT_BLOCK 461 parsed_hsp['hit']['seq'] = '' 462 # check for conservation block 463 elif self.line.startswith('; al_cons'): 464 state = _STATE_CONS_BLOCK 465 hsp.fragment.aln_annotation['similarity'] = '' 466 elif self.line.startswith(';'): 467 # Fasta outputs do not make a clear distinction between Hit 468 # and HSPs, so we check the attribute names to determine 469 # whether it belongs to a Hit or HSP 470 regx = re.search(_RE_ATTR, self.line.strip()) 471 name = regx.group(1) 472 value = regx.group(2) 473 474 # for values before the '>...' query block 475 if state == _STATE_NONE: 476 if name in _HSP_ATTR_MAP: 477 attr_name, caster = _HSP_ATTR_MAP[name] 478 if caster is not str: 479 value = caster(value) 480 if name in ['_ident', '_sim']: 481 value *= 100 482 setattr(hsp, attr_name, value) 483 # otherwise, pool the values for processing later 484 elif state == _STATE_QUERY_BLOCK: 485 parsed_hsp['query'][name] = value 486 elif state == _STATE_HIT_BLOCK: 487 if name == '_len': 488 seq_len = int(value) 489 else: 490 parsed_hsp['hit'][name] = value 491 # for values in the hit block 492 else: 493 raise ValueError("Unexpected line: %r" % self.line) 494 # otherwise, it must be lines containing the sequences 495 else: 496 assert '>' not in self.line 497 # if we're in hit, parse into hsp.hit 498 if state == _STATE_HIT_BLOCK: 499 parsed_hsp['hit']['seq'] += self.line.strip() 500 elif state == _STATE_QUERY_BLOCK: 501 parsed_hsp['query']['seq'] += self.line.strip() 502 elif state == _STATE_CONS_BLOCK: 503 hsp.fragment.aln_annotation['similarity'] += \ 504 self.line.strip('\r\n') 505 # we should not get here! 506 else: 507 raise ValueError("Unexpected line: %r" % self.line) 508 509 self.line = self.handle.readline()
510 511
512 -class FastaM10Indexer(SearchIndexer):
513 """Indexer class for Bill Pearson's FASTA suite's -m 10 output.""" 514 515 _parser = FastaM10Parser 516
517 - def __init__(self, filename):
518 SearchIndexer.__init__(self, filename) 519 self._handle = UndoHandle(self._handle)
520
521 - def __iter__(self):
522 handle = self._handle 523 handle.seek(0) 524 start_offset = handle.tell() 525 qresult_key = None 526 query_mark = _as_bytes('>>>') 527 528 while True: 529 line = handle.readline() 530 peekline = handle.peekline() 531 end_offset = handle.tell() 532 533 if not line.startswith(query_mark) and query_mark in line: 534 regx = re.search(_RE_ID_DESC_SEQLEN_IDX, line) 535 qresult_key = _bytes_to_string(regx.group(1)) 536 start_offset = end_offset - len(line) 537 # yield whenever we encounter a new query or at the end of the file 538 if qresult_key is not None: 539 if (not peekline.startswith(query_mark) 540 and query_mark in peekline) or not line: 541 yield qresult_key, start_offset, end_offset - start_offset 542 if not line: 543 break 544 start_offset = end_offset
545
546 - def get_raw(self, offset):
547 handle = self._handle 548 qresult_raw = _as_bytes('') 549 query_mark = _as_bytes('>>>') 550 551 # read header first 552 handle.seek(0) 553 while True: 554 line = handle.readline() 555 peekline = handle.peekline() 556 qresult_raw += line 557 if not peekline.startswith(query_mark) and query_mark in peekline: 558 break 559 560 # and read the qresult raw string 561 handle.seek(offset) 562 while True: 563 # preserve whitespace, don't use read_forward 564 line = handle.readline() 565 peekline = handle.peekline() 566 qresult_raw += line 567 568 # break when we've reached qresult end 569 if (not peekline.startswith(query_mark) and query_mark in peekline) or \ 570 not line: 571 break 572 573 # append mock end marker to qresult_raw, since it's not always present 574 return qresult_raw + _as_bytes('>>><<<\n')
575 576 577 # if not used as a module, run the doctest 578 if __name__ == "__main__": 579 from Bio._utils import run_doctest 580 run_doctest() 581