Package Bio :: Package SearchIO :: Module FastaIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SearchIO.FastaIO

  1  # Adapted from Bio.AlignIO.FastaIO copyright 2008-2011 by Peter Cock. 
  2  # Copyright 2012 by Wibowo Arindrarto. 
  3  # All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7   
  8  """Bio.SearchIO support for Bill Pearson's FASTA tools. 
  9   
 10  This module adds support for parsing FASTA outputs. FASTA is a suite of 
 11  programs that finds regions of local or global similarity between protein 
 12  or nucleotide sequences, either by searching databases or identifying 
 13  local duplications. 
 14   
 15  Bio.SearchIO.FastaIO was tested on the following FASTA flavors and versions: 
 16   
 17      - flavors: fasta, ssearch, tfastx 
 18      - versions: 35, 36 
 19   
 20  Other flavors and/or versions may introduce some bugs. Please file a bug report 
 21  if you see such problems to Biopython's bug tracker. 
 22   
 23  More information on FASTA are available through these links: 
 24   
 25      - Website: http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml 
 26      - User guide: http://fasta.bioch.virginia.edu/fasta_www2/fasta_guide.pdf 
 27   
 28   
 29  Supported Formats 
 30  ================= 
 31   
 32  Bio.SearchIO.FastaIO supports parsing and indexing FASTA outputs triggered by 
 33  the -m 10 flag. Other formats that mimic other programs (e.g. the BLAST tabular 
 34  format using the -m 8 flag) may be parseable but using SearchIO's other parsers 
 35  (in this case, using the 'blast-tab' parser). 
 36   
 37   
 38  fasta-m10 
 39  ========= 
 40   
 41  Note that in FASTA -m 10 outputs, HSPs from different strands are considered to 
 42  be from different hits. They are listed as two separate entries in the hit 
 43  table. FastaIO recognizes this and will group HSPs with the same hit ID into a 
 44  single Hit object, regardless of strand. 
 45   
 46  FASTA also sometimes output extra sequences adjacent to the HSP match. These 
 47  extra sequences are discarded by FastaIO. Only regions containing the actual 
 48  sequence match are extracted. 
 49   
 50  The following object attributes are provided: 
 51   
 52  +-----------------+-------------------------+----------------------------------+ 
 53  | Object          | Attribute               | Value                            | 
 54  +=================+=========================+==================================+ 
 55  | QueryResult     | description             | query sequence description       | 
 56  |                 +-------------------------+----------------------------------+ 
 57  |                 | id                      | query sequence ID                | 
 58  |                 +-------------------------+----------------------------------+ 
 59  |                 | program                 | FASTA flavor                     | 
 60  |                 +-------------------------+----------------------------------+ 
 61  |                 | seq_len                 | full length of query sequence    | 
 62  |                 +-------------------------+----------------------------------+ 
 63  |                 | target                  | target search database           | 
 64  |                 +-------------------------+----------------------------------+ 
 65  |                 | version                 | FASTA version                    | 
 66  +-----------------+-------------------------+----------------------------------+ 
 67  | Hit             | seq_len                 | full length of the hit sequence  | 
 68  +-----------------+-------------------------+----------------------------------+ 
 69  | HSP             | bitscore                | \*_bits line                     | 
 70  |                 +-------------------------+----------------------------------+ 
 71  |                 | evalue                  | \*_expect line                   | 
 72  |                 +-------------------------+----------------------------------+ 
 73  |                 | ident_pct               | \*_ident line                    | 
 74  |                 +-------------------------+----------------------------------+ 
 75  |                 | init1_score             | \*_init1 line                    | 
 76  |                 +-------------------------+----------------------------------+ 
 77  |                 | initn_score             | \*_initn line                    | 
 78  |                 +-------------------------+----------------------------------+ 
 79  |                 | opt_score               | \*_opt line, \*_s-w opt line     | 
 80  |                 +-------------------------+----------------------------------+ 
 81  |                 | pos_pct                 | \*_sim line                      | 
 82  |                 +-------------------------+----------------------------------+ 
 83  |                 | sw_score                | \*_score line                    | 
 84  |                 +-------------------------+----------------------------------+ 
 85  |                 | z_score                 | \*_z-score line                  | 
 86  +-----------------+-------------------------+----------------------------------+ 
 87  | HSPFragment     | aln_annotation          | al_cons block, if present        | 
 88  | (also via HSP)  +-------------------------+----------------------------------+ 
 89  |                 | hit                     | hit sequence                     | 
 90  |                 +-------------------------+----------------------------------+ 
 91  |                 | hit_end                 | hit sequence end coordinate      | 
 92  |                 +-------------------------+----------------------------------+ 
 93  |                 | hit_start               | hit sequence start coordinate    | 
 94  |                 +-------------------------+----------------------------------+ 
 95  |                 | hit_strand              | hit sequence strand              | 
 96  |                 +-------------------------+----------------------------------+ 
 97  |                 | query                   | query sequence                   | 
 98  |                 +-------------------------+----------------------------------+ 
 99  |                 | query_end               | query sequence end coordinate    | 
100  |                 +-------------------------+----------------------------------+ 
101  |                 | query_start             | query sequence start coordinate  | 
102  |                 +-------------------------+----------------------------------+ 
103  |                 | query_strand            | query sequence strand            | 
104  +-----------------+-------------------------+----------------------------------+ 
105   
106  """ 
107   
108  import re 
109   
110  from Bio._py3k import _as_bytes, _bytes_to_string 
111  from Bio.Alphabet import generic_dna, generic_protein 
112  from Bio.File import UndoHandle 
113  from Bio.SearchIO._index import SearchIndexer 
114  from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment 
115   
116   
117  __all__ = ['FastaM10Parser', 'FastaM10Indexer'] 
118   
119  __docformat__ = "restructuredtext en" 
120   
121   
122  # precompile regex patterns 
123  # regex for program name 
124  _RE_FLAVS = re.compile(r't?fast[afmsxy]|pr[sf][sx]|lalign|[gs]?[glso]search') 
125  # regex for sequence ID and length ~ deals with both \n and \r\n 
126  _PTR_ID_DESC_SEQLEN = r'>>>(.+?)\s+(.*?) *- (\d+) (?:aa|nt)\s*$' 
127  _RE_ID_DESC_SEQLEN = re.compile(_PTR_ID_DESC_SEQLEN) 
128  _RE_ID_DESC_SEQLEN_IDX = re.compile(_as_bytes(_PTR_ID_DESC_SEQLEN)) 
129  # regex for qresult, hit, or hsp attribute value 
130  _RE_ATTR = re.compile(r'^; [a-z]+(_[ \w-]+):\s+(.*)$') 
131  # regex for capturing excess start and end sequences in alignments 
132  _RE_START_EXC = re.compile(r'^-*') 
133  _RE_END_EXC = re.compile(r'-*$') 
134   
135  # attribute name mappings 
136  _HSP_ATTR_MAP = { 
137      '_initn': ('initn_score', int), 
138      '_init1': ('init1_score', int), 
139      '_opt': ('opt_score', int), 
140      '_s-w opt': ('opt_score', int), 
141      '_z-score': ('z_score', float), 
142      '_bits': ('bitscore', float), 
143      '_expect': ('evalue', float), 
144      '_score': ('sw_score', int), 
145      '_ident': ('ident_pct', float), 
146      '_sim': ('pos_pct', float), 
147  } 
148   
149  # state flags 
150  _STATE_NONE = 0 
151  _STATE_QUERY_BLOCK = 1 
152  _STATE_HIT_BLOCK = 2 
153  _STATE_CONS_BLOCK = 3 
154   
155   
156 -def _set_qresult_hits(qresult, hit_rows=[]):
157 """Helper function for appending Hits without alignments into QueryResults.""" 158 for hit_row in hit_rows: 159 hit_id, remainder = hit_row.split(' ', 1) 160 # TODO: parse hit and hsp properties properly; by dealing with: 161 # - any character in the description (brackets, spaces, etc.) 162 # - possible [f] or [r] presence (for frame info) 163 # - possible presence of E2() column 164 # - possible incomplete hit_id due to column length limit 165 # The current method only looks at the Hit ID, none of the things above 166 if hit_id not in qresult: 167 frag = HSPFragment(hit_id, qresult.id) 168 hsp = HSP([frag]) 169 hit = Hit([hsp]) 170 qresult.append(hit) 171 172 return qresult
173 174
175 -def _set_hsp_seqs(hsp, parsed, program):
176 """Helper function for the main parsing code. 177 178 :param hsp: HSP whose properties will be set 179 :type hsp: HSP 180 :param parsed: parsed values of the HSP attributes 181 :type parsed: dictionary {string: object} 182 :param program: program name 183 :type program: string 184 185 """ 186 # get aligned sequences and check if they have equal lengths 187 start = 0 188 for seq_type in ('hit', 'query'): 189 if 'tfast' not in program: 190 pseq = parsed[seq_type] 191 # adjust start and end coordinates based on the amount of 192 # filler characters 193 start, stop = _get_aln_slice_coords(pseq) 194 start_adj = len(re.search(_RE_START_EXC, pseq['seq']).group(0)) 195 stop_adj = len(re.search(_RE_END_EXC, pseq['seq']).group(0)) 196 start = start + start_adj 197 stop = stop + start_adj - stop_adj 198 parsed[seq_type]['seq'] = pseq['seq'][start:stop] 199 assert len(parsed['query']['seq']) == len(parsed['hit']['seq']), "%r %r" \ 200 % (len(parsed['query']['seq']), len(parsed['hit']['seq'])) 201 if 'similarity' in hsp.aln_annotation: 202 # only using 'start' since FASTA seems to have trimmed the 'excess' 203 # end part 204 hsp.aln_annotation['similarity'] = hsp.aln_annotation['similarity'][start:] 205 # hit or query works equally well here 206 assert len(hsp.aln_annotation['similarity']) == len(parsed['hit']['seq']) 207 208 # query and hit sequence types must be the same 209 assert parsed['query']['_type'] == parsed['hit']['_type'] 210 type_val = parsed['query']['_type'] # hit works fine too 211 alphabet = generic_dna if type_val == 'D' else generic_protein 212 setattr(hsp.fragment, 'alphabet', alphabet) 213 214 for seq_type in ('hit', 'query'): 215 # get and set start and end coordinates 216 start = int(parsed[seq_type]['_start']) 217 end = int(parsed[seq_type]['_stop']) 218 219 setattr(hsp.fragment, seq_type + '_start', min(start, end) - 1) 220 setattr(hsp.fragment, seq_type + '_end', max(start, end)) 221 # set seq and alphabet 222 setattr(hsp.fragment, seq_type, parsed[seq_type]['seq']) 223 224 if alphabet is not generic_protein: 225 # get strand from coordinate; start <= end is plus 226 # start > end is minus 227 if start <= end: 228 setattr(hsp.fragment, seq_type + '_strand', 1) 229 else: 230 setattr(hsp.fragment, seq_type + '_strand', -1) 231 else: 232 setattr(hsp.fragment, seq_type + '_strand', 0)
233 234
235 -def _get_aln_slice_coords(parsed_hsp):
236 """Helper function for the main parsing code. 237 238 To get the actual pairwise alignment sequences, we must first 239 translate the un-gapped sequence based coordinates into positions 240 in the gapped sequence (which may have a flanking region shown 241 using leading - characters). To date, I have never seen any 242 trailing flanking region shown in the m10 file, but the 243 following code should also cope with that. 244 245 Note that this code seems to work fine even when the "sq_offset" 246 entries are prsent as a result of using the -X command line option. 247 """ 248 seq = parsed_hsp['seq'] 249 seq_stripped = seq.strip('-') 250 disp_start = int(parsed_hsp['_display_start']) 251 start = int(parsed_hsp['_start']) 252 stop = int(parsed_hsp['_stop']) 253 254 if start <= stop: 255 start = start - disp_start 256 stop = stop - disp_start + 1 257 else: 258 start = disp_start - start 259 stop = disp_start - stop + 1 260 stop += seq_stripped.count('-') 261 assert 0 <= start and start < stop and stop <= len(seq_stripped), \ 262 "Problem with sequence start/stop,\n%s[%i:%i]\n%s" \ 263 % (seq, start, stop, parsed_hsp) 264 return start, stop
265 266
267 -class FastaM10Parser(object):
268 """Parser for Bill Pearson's FASTA suite's -m 10 output.""" 269
270 - def __init__(self, handle, __parse_hit_table=False):
271 self.handle = UndoHandle(handle) 272 self._preamble = self._parse_preamble()
273
274 - def __iter__(self):
275 for qresult in self._parse_qresult(): 276 # re-set desc, for hsp query description 277 qresult.description = qresult.description 278 yield qresult
279
280 - def _parse_preamble(self):
281 """Parses the Fasta preamble for Fasta flavor and version.""" 282 preamble = {} 283 while True: 284 self.line = self.handle.readline() 285 # this should be the line just before the first qresult 286 if self.line.startswith('Query'): 287 break 288 # try to match for version line 289 elif self.line.startswith(' version'): 290 preamble['version'] = self.line.split(' ')[2] 291 else: 292 # try to match for flavor line 293 flav_match = re.match(_RE_FLAVS, self.line.lower()) 294 if flav_match: 295 preamble['program'] = flav_match.group(0) 296 297 return preamble
298
299 - def __parse_hit_table(self):
300 """Parses hit table rows.""" 301 # move to the first row 302 self.line = self.handle.readline() 303 # parse hit table until we see an empty line 304 hit_rows = [] 305 while self.line and not self.line.strip(): 306 hit_rows.append(self.line.strip()) 307 self.line = self.handle.readline() 308 return hit_rows
309
310 - def _parse_qresult(self):
311 # initial qresult value 312 qresult = None 313 hit_rows = [] 314 # state values 315 state_QRES_NEW = 1 316 state_QRES_HITTAB = 3 317 state_QRES_CONTENT = 5 318 state_QRES_END = 7 319 320 while True: 321 322 # one line before the hit table 323 if self.line.startswith('The best scores are:'): 324 qres_state = state_QRES_HITTAB 325 # the end of a query or the file altogether 326 elif self.line.strip() == '>>>///' or not self.line: 327 qres_state = state_QRES_END 328 # the beginning of a new query 329 elif not self.line.startswith('>>>') and '>>>' in self.line: 330 qres_state = state_QRES_NEW 331 # the beginning of the query info and its hits + hsps 332 elif self.line.startswith('>>>') and not \ 333 self.line.strip() == '>>><<<': 334 qres_state = state_QRES_CONTENT 335 # default qres mark 336 else: 337 qres_state = None 338 339 if qres_state is not None: 340 if qres_state == state_QRES_HITTAB: 341 # parse hit table if flag is set 342 hit_rows = self.__parse_hit_table() 343 344 elif qres_state == state_QRES_END: 345 yield _set_qresult_hits(qresult, hit_rows) 346 break 347 348 elif qres_state == state_QRES_NEW: 349 # if qresult is filled, yield it first 350 if qresult is not None: 351 yield _set_qresult_hits(qresult, hit_rows) 352 regx = re.search(_RE_ID_DESC_SEQLEN, self.line) 353 query_id = regx.group(1) 354 seq_len = regx.group(3) 355 desc = regx.group(2) 356 qresult = QueryResult(id=query_id) 357 qresult.seq_len = int(seq_len) 358 # get target from the next line 359 self.line = self.handle.readline() 360 qresult.target = [x for x in self.line.split(' ') if x][1].strip() 361 if desc is not None: 362 qresult.description = desc 363 # set values from preamble 364 for key, value in self._preamble.items(): 365 setattr(qresult, key, value) 366 367 elif qres_state == state_QRES_CONTENT: 368 assert self.line[3:].startswith(qresult.id), self.line 369 for hit, strand in self._parse_hit(query_id): 370 # HACK: re-set desc, for hsp hit and query description 371 hit.description = hit.description 372 hit.query_description = qresult.description 373 # if hit is not in qresult, append it 374 if hit.id not in qresult: 375 qresult.append(hit) 376 # otherwise, it might be the same hit with a different strand 377 else: 378 # make sure strand is different and then append hsp to 379 # existing hit 380 for hsp in hit.hsps: 381 assert strand != hsp.query_strand 382 qresult[hit.id].append(hsp) 383 384 self.line = self.handle.readline()
385
386 - def _parse_hit(self, query_id):
387 while True: 388 self.line = self.handle.readline() 389 if self.line.startswith('>>'): 390 break 391 392 strand = None 393 hsp_list = [] 394 while True: 395 peekline = self.handle.peekline() 396 # yield hit if we've reached the start of a new query or 397 # the end of the search 398 if peekline.strip() in [">>><<<", ">>>///"] or \ 399 (not peekline.startswith('>>>') and '>>>' in peekline): 400 # append last parsed_hsp['hit']['seq'] line 401 if state == _STATE_HIT_BLOCK: 402 parsed_hsp['hit']['seq'] += self.line.strip() 403 elif state == _STATE_CONS_BLOCK: 404 hsp.aln_annotation['similarity'] += \ 405 self.line.strip('\r\n') 406 # process HSP alignment and coordinates 407 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 408 hit = Hit(hsp_list) 409 hit.description = hit_desc 410 hit.seq_len = seq_len 411 yield hit, strand 412 hsp_list = [] 413 break 414 # yield hit and create a new one if we're still in the same query 415 elif self.line.startswith('>>'): 416 # try yielding, if we have hsps 417 if hsp_list: 418 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 419 hit = Hit(hsp_list) 420 hit.description = hit_desc 421 hit.seq_len = seq_len 422 yield hit, strand 423 hsp_list = [] 424 # try to get the hit id and desc, and handle cases without descs 425 try: 426 hit_id, hit_desc = self.line[2:].strip().split(' ', 1) 427 except ValueError: 428 hit_id = self.line[2:].strip().split(' ', 1)[0] 429 hit_desc = '' 430 # create the HSP object for Hit 431 frag = HSPFragment(hit_id, query_id) 432 hsp = HSP([frag]) 433 hsp_list.append(hsp) 434 # set or reset the state to none 435 state = _STATE_NONE 436 parsed_hsp = {'query': {}, 'hit': {}} 437 # create and append a new HSP if line starts with '>--' 438 elif self.line.startswith('>--'): 439 # set seq attributes of previous hsp 440 _set_hsp_seqs(hsp, parsed_hsp, self._preamble['program']) 441 # and create a new one 442 frag = HSPFragment(hit_id, query_id) 443 hsp = HSP([frag]) 444 hsp_list.append(hsp) 445 # set the state ~ none yet 446 state = _STATE_NONE 447 parsed_hsp = {'query': {}, 'hit': {}} 448 # this is either query or hit data in the HSP, depending on the state 449 elif self.line.startswith('>'): 450 if state == _STATE_NONE: 451 # make sure it's the correct query 452 assert query_id.startswith(self.line[1:].split(' ')[0]), \ 453 "%r vs %r" % (query_id, self.line) 454 state = _STATE_QUERY_BLOCK 455 parsed_hsp['query']['seq'] = '' 456 elif state == _STATE_QUERY_BLOCK: 457 # make sure it's the correct hit 458 assert hit_id.startswith(self.line[1:].split(' ')[0]) 459 state = _STATE_HIT_BLOCK 460 parsed_hsp['hit']['seq'] = '' 461 # check for conservation block 462 elif self.line.startswith('; al_cons'): 463 state = _STATE_CONS_BLOCK 464 hsp.fragment.aln_annotation['similarity'] = '' 465 elif self.line.startswith(';'): 466 # Fasta outputs do not make a clear distinction between Hit 467 # and HSPs, so we check the attribute names to determine 468 # whether it belongs to a Hit or HSP 469 regx = re.search(_RE_ATTR, self.line.strip()) 470 name = regx.group(1) 471 value = regx.group(2) 472 473 # for values before the '>...' query block 474 if state == _STATE_NONE: 475 if name in _HSP_ATTR_MAP: 476 attr_name, caster = _HSP_ATTR_MAP[name] 477 if caster is not str: 478 value = caster(value) 479 if name in ['_ident', '_sim']: 480 value *= 100 481 setattr(hsp, attr_name, value) 482 # otherwise, pool the values for processing later 483 elif state == _STATE_QUERY_BLOCK: 484 parsed_hsp['query'][name] = value 485 elif state == _STATE_HIT_BLOCK: 486 if name == '_len': 487 seq_len = int(value) 488 else: 489 parsed_hsp['hit'][name] = value 490 # for values in the hit block 491 else: 492 raise ValueError("Unexpected line: %r" % self.line) 493 # otherwise, it must be lines containing the sequences 494 else: 495 assert '>' not in self.line 496 # if we're in hit, parse into hsp.hit 497 if state == _STATE_HIT_BLOCK: 498 parsed_hsp['hit']['seq'] += self.line.strip() 499 elif state == _STATE_QUERY_BLOCK: 500 parsed_hsp['query']['seq'] += self.line.strip() 501 elif state == _STATE_CONS_BLOCK: 502 hsp.fragment.aln_annotation['similarity'] += \ 503 self.line.strip('\r\n') 504 # we should not get here! 505 else: 506 raise ValueError("Unexpected line: %r" % self.line) 507 508 self.line = self.handle.readline()
509 510
511 -class FastaM10Indexer(SearchIndexer):
512 """Indexer class for Bill Pearson's FASTA suite's -m 10 output.""" 513 514 _parser = FastaM10Parser 515
516 - def __init__(self, filename):
517 SearchIndexer.__init__(self, filename) 518 self._handle = UndoHandle(self._handle)
519
520 - def __iter__(self):
521 handle = self._handle 522 handle.seek(0) 523 start_offset = handle.tell() 524 qresult_key = None 525 query_mark = _as_bytes('>>>') 526 527 while True: 528 line = handle.readline() 529 peekline = handle.peekline() 530 end_offset = handle.tell() 531 532 if not line.startswith(query_mark) and query_mark in line: 533 regx = re.search(_RE_ID_DESC_SEQLEN_IDX, line) 534 qresult_key = _bytes_to_string(regx.group(1)) 535 start_offset = end_offset - len(line) 536 # yield whenever we encounter a new query or at the end of the file 537 if qresult_key is not None: 538 if (not peekline.startswith(query_mark) 539 and query_mark in peekline) or not line: 540 yield qresult_key, start_offset, end_offset - start_offset 541 if not line: 542 break 543 start_offset = end_offset
544
545 - def get_raw(self, offset):
546 handle = self._handle 547 qresult_raw = _as_bytes('') 548 query_mark = _as_bytes('>>>') 549 550 # read header first 551 handle.seek(0) 552 while True: 553 line = handle.readline() 554 peekline = handle.peekline() 555 qresult_raw += line 556 if not peekline.startswith(query_mark) and query_mark in peekline: 557 break 558 559 # and read the qresult raw string 560 handle.seek(offset) 561 while True: 562 # preserve whitespace, don't use read_forward 563 line = handle.readline() 564 peekline = handle.peekline() 565 qresult_raw += line 566 567 # break when we've reached qresult end 568 if (not peekline.startswith(query_mark) and query_mark in peekline) or \ 569 not line: 570 break 571 572 # append mock end marker to qresult_raw, since it's not always present 573 return qresult_raw + _as_bytes('>>><<<\n')
574 575 576 # if not used as a module, run the doctest 577 if __name__ == "__main__": 578 from Bio._utils import run_doctest 579 run_doctest() 580