Package Bio :: Package AlignIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.AlignIO

  1  # Copyright 2008-2010 by Peter Cock.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Multiple sequence alignment input/output as alignment objects. 
  7   
  8  The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in 
  9  fact the two are connected internally.  Both modules use the same set of file 
 10  format names (lower case strings).  From the user's perspective, you can read 
 11  in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you 
 12  can read in the sequences within these alignmenta using Bio.SeqIO. 
 13   
 14  Bio.AlignIO is also documented at U{http://biopython.org/wiki/AlignIO} and by 
 15  a whole chapter in our tutorial: 
 16   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html} 
 17   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf} 
 18   
 19  Input 
 20  ===== 
 21  For the typical special case when your file or handle contains one and only 
 22  one alignment, use the function Bio.AlignIO.read().  This takes an input file 
 23  handle (or in recent versions of Biopython a filename as a string), format 
 24  string and optional number of sequences per alignment.  It will return a single 
 25  MultipleSeqAlignment object (or raise an exception if there isn't just one 
 26  alignment): 
 27   
 28      >>> from Bio import AlignIO 
 29      >>> align = AlignIO.read("Phylip/interlaced.phy", "phylip") 
 30      >>> print(align) 
 31      SingleLetterAlphabet() alignment with 3 rows and 384 columns 
 32      -----MKVILLFVLAVFTVFVSS---------------RGIPPE...I-- CYS1_DICDI 
 33      MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTL...VAA ALEU_HORVU 
 34      ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSL...PLV CATH_HUMAN 
 35   
 36  For the general case, when the handle could contain any number of alignments, 
 37  use the function Bio.AlignIO.parse(...) which takes the same arguments, but 
 38  returns an iterator giving MultipleSeqAlignment objects (typically used in a 
 39  for loop). If you want random access to the alignments by number, turn this 
 40  into a list: 
 41   
 42      >>> from Bio import AlignIO 
 43      >>> alignments = list(AlignIO.parse("Emboss/needle.txt", "emboss")) 
 44      >>> print(alignments[2]) 
 45      SingleLetterAlphabet() alignment with 2 rows and 120 columns 
 46      -KILIVDDQYGIRILLNEVFNKEGYQTFQAANGLQALDIVTKER...--- ref_rec 
 47      LHIVVVDDDPGTCVYIESVFAELGHTCKSFVRPEAAEEYILTHP...HKE gi|94967506|receiver 
 48   
 49  Most alignment file formats can be concatenated so as to hold as many 
 50  different multiple sequence alignments as possible.  One common example 
 51  is the output of the tool seqboot in the PHLYIP suite.  Sometimes there 
 52  can be a file header and footer, as seen in the EMBOSS alignment output. 
 53   
 54  Output 
 55  ====== 
 56  Use the function Bio.AlignIO.write(...), which takes a complete set of 
 57  Alignment objects (either as a list, or an iterator), an output file handle 
 58  (or filename in recent versions of Biopython) and of course the file format:: 
 59   
 60      from Bio import AlignIO 
 61      alignments = ... 
 62      count = SeqIO.write(alignments, "example.faa", "fasta") 
 63   
 64  If using a handle make sure to close it to flush the data to the disk:: 
 65   
 66      from Bio import AlignIO 
 67      alignments = ... 
 68      with open("example.faa", "w") as handle: 
 69          count = SeqIO.write(alignments, handle, "fasta") 
 70   
 71  In general, you are expected to call this function once (with all your 
 72  alignments) and then close the file handle.  However, for file formats 
 73  like PHYLIP where multiple alignments are stored sequentially (with no file 
 74  header and footer), then multiple calls to the write function should work as 
 75  expected when using handles. 
 76   
 77  If you are using a filename, the repeated calls to the write functions will 
 78  overwrite the existing file each time. 
 79   
 80  Conversion 
 81  ========== 
 82  The Bio.AlignIO.convert(...) function allows an easy interface for simple 
 83  alignnment file format conversions. Additionally, it may use file format 
 84  specific optimisations so this should be the fastest way too. 
 85   
 86  In general however, you can combine the Bio.AlignIO.parse(...) function with 
 87  the Bio.AlignIO.write(...) function for sequence file conversion. Using 
 88  generator expressions provides a memory efficient way to perform filtering or 
 89  other extra operations as part of the process. 
 90   
 91  File Formats 
 92  ============ 
 93  When specifying the file format, use lowercase strings.  The same format 
 94  names are also used in Bio.SeqIO and include the following: 
 95   
 96   - clustal   - Output from Clustal W or X, see also the module Bio.Clustalw 
 97                 which can be used to run the command line tool from Biopython. 
 98   - emboss    - EMBOSS tools' "pairs" and "simple" alignment formats. 
 99   - fasta     - The generic sequence file format where each record starts with 
100                 an identifer line starting with a ">" character, followed by 
101                 lines of sequence. 
102   - fasta-m10 - For the pairswise alignments output by Bill Pearson's FASTA 
103                 tools when used with the -m 10 command line option for machine 
104                 readable output. 
105   - ig        - The IntelliGenetics file format, apparently the same as the 
106                 MASE alignment format. 
107   - nexus     - Output from NEXUS, see also the module Bio.Nexus which can also 
108                 read any phylogenetic trees in these files. 
109   - phylip    - Interlaced PHYLIP, as used by the PHLIP tools. 
110   - phylip-sequential - Sequential PHYLIP. 
111   - phylip-relaxed - PHYLIP like format allowing longer names. 
112   - stockholm - A richly annotated alignment file format used by PFAM. 
113   
114  Note that while Bio.AlignIO can read all the above file formats, it cannot 
115  write to all of them. 
116   
117  You can also use any file format supported by Bio.SeqIO, such as "fasta" or 
118  "ig" (which are listed above), PROVIDED the sequences in your file are all the 
119  same length. 
120  """ 
121   
122   
123  from __future__ import print_function 
124  from Bio._py3k import basestring 
125   
126  __docformat__ = "epytext en"  # not just plaintext 
127   
128  #TODO 
129  # - define policy on reading aligned sequences with gaps in 
130  #   (e.g. - and . characters) including how the alphabet interacts 
131  # 
132  # - Can we build the to_alignment(...) functionality 
133  #   into the generic Alignment class instead? 
134  # 
135  # - How best to handle unique/non unique record.id when writing. 
136  #   For most file formats reading such files is fine; The stockholm 
137  #   parser would fail. 
138  # 
139  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
140  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
141   
142  from Bio.Align import MultipleSeqAlignment 
143  from Bio.Align.Generic import Alignment 
144  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
145  from Bio.File import as_handle 
146   
147  from . import StockholmIO 
148  from . import ClustalIO 
149  from . import NexusIO 
150  from . import PhylipIO 
151  from . import EmbossIO 
152  from . import FastaIO 
153   
154  #Convention for format names is "mainname-subtype" in lower case. 
155  #Please use the same names as BioPerl and EMBOSS where possible. 
156   
157  _FormatToIterator = {  # "fasta" is done via Bio.SeqIO 
158                       "clustal": ClustalIO.ClustalIterator, 
159                       "emboss": EmbossIO.EmbossIterator, 
160                       "fasta-m10": FastaIO.FastaM10Iterator, 
161                       "nexus": NexusIO.NexusIterator, 
162                       "phylip": PhylipIO.PhylipIterator, 
163                       "phylip-sequential": PhylipIO.SequentialPhylipIterator, 
164                       "phylip-relaxed": PhylipIO.RelaxedPhylipIterator, 
165                       "stockholm": StockholmIO.StockholmIterator, 
166                       } 
167   
168  _FormatToWriter = {  # "fasta" is done via Bio.SeqIO 
169                       # "emboss" : EmbossIO.EmbossWriter, (unfinished) 
170                     "nexus": NexusIO.NexusWriter, 
171                     "phylip": PhylipIO.PhylipWriter, 
172                     "phylip-sequential": PhylipIO.SequentialPhylipWriter, 
173                     "phylip-relaxed": PhylipIO.RelaxedPhylipWriter, 
174                     "stockholm": StockholmIO.StockholmWriter, 
175                     "clustal": ClustalIO.ClustalWriter, 
176                     } 
177   
178   
179 -def write(alignments, handle, format):
180 """Write complete set of alignments to a file. 181 182 Arguments: 183 - alignments - A list (or iterator) of Alignment objects (ideally the 184 new MultipleSeqAlignment objects), or (if using Biopython 185 1.54 or later) a single alignment object. 186 - handle - File handle object to write to, or filename as string 187 (note older versions of Biopython only took a handle). 188 - format - lower case string describing the file format to write. 189 190 You should close the handle after calling this function. 191 192 Returns the number of alignments written (as an integer). 193 """ 194 from Bio import SeqIO 195 196 #Try and give helpful error messages: 197 if not isinstance(format, basestring): 198 raise TypeError("Need a string for the file format (lower case)") 199 if not format: 200 raise ValueError("Format required (lower case string)") 201 if format != format.lower(): 202 raise ValueError("Format string '%s' should be lower case" % format) 203 204 if isinstance(alignments, Alignment): 205 #This raised an exception in older versions of Biopython 206 alignments = [alignments] 207 208 with as_handle(handle, 'w') as fp: 209 #Map the file format to a writer class 210 if format in _FormatToWriter: 211 writer_class = _FormatToWriter[format] 212 count = writer_class(fp).write_file(alignments) 213 elif format in SeqIO._FormatToWriter: 214 #Exploit the existing SeqIO parser to do the dirty work! 215 #TODO - Can we make one call to SeqIO.write() and count the alignments? 216 count = 0 217 for alignment in alignments: 218 if not isinstance(alignment, Alignment): 219 raise TypeError( 220 "Expect a list or iterator of Alignment objects.") 221 SeqIO.write(alignment, fp, format) 222 count += 1 223 elif format in _FormatToIterator or format in SeqIO._FormatToIterator: 224 raise ValueError("Reading format '%s' is supported, but not writing" 225 % format) 226 else: 227 raise ValueError("Unknown format '%s'" % format) 228 229 assert isinstance(count, int), "Internal error - the underlying %s " \ 230 "writer should have returned the alignment count, not %s" \ 231 % (format, repr(count)) 232 233 return count
234 235 236 #This is a generator function!
237 -def _SeqIO_to_alignment_iterator(handle, format, alphabet=None, seq_count=None):
238 """Uses Bio.SeqIO to create an MultipleSeqAlignment iterator (PRIVATE). 239 240 Arguments: 241 - handle - handle to the file. 242 - format - string describing the file format. 243 - alphabet - optional Alphabet object, useful when the sequence type 244 cannot be automatically inferred from the file itself 245 (e.g. fasta, phylip, clustal) 246 - seq_count - Optional integer, number of sequences expected in each 247 alignment. Recommended for fasta format files. 248 249 If count is omitted (default) then all the sequences in the file are 250 combined into a single MultipleSeqAlignment. 251 """ 252 from Bio import SeqIO 253 assert format in SeqIO._FormatToIterator 254 255 if seq_count: 256 #Use the count to split the records into batches. 257 seq_record_iterator = SeqIO.parse(handle, format, alphabet) 258 259 records = [] 260 for record in seq_record_iterator: 261 records.append(record) 262 if len(records) == seq_count: 263 yield MultipleSeqAlignment(records, alphabet) 264 records = [] 265 if len(records) > 0: 266 raise ValueError("Check seq_count argument, not enough sequences?") 267 else: 268 #Must assume that there is a single alignment using all 269 #the SeqRecord objects: 270 records = list(SeqIO.parse(handle, format, alphabet)) 271 if records: 272 yield MultipleSeqAlignment(records, alphabet) 273 raise StopIteration
274 275
276 -def _force_alphabet(alignment_iterator, alphabet):
277 """Iterate over alignments, over-riding the alphabet (PRIVATE).""" 278 #Assume the alphabet argument has been pre-validated 279 given_base_class = _get_base_alphabet(alphabet).__class__ 280 for align in alignment_iterator: 281 if not isinstance(_get_base_alphabet(align._alphabet), 282 given_base_class): 283 raise ValueError("Specified alphabet %s clashes with " 284 "that determined from the file, %s" 285 % (repr(alphabet), repr(align._alphabet))) 286 for record in align: 287 if not isinstance(_get_base_alphabet(record.seq.alphabet), 288 given_base_class): 289 raise ValueError("Specified alphabet %s clashes with " 290 "that determined from the file, %s" 291 % (repr(alphabet), repr(record.seq.alphabet))) 292 record.seq.alphabet = alphabet 293 align._alphabet = alphabet 294 yield align
295 296
297 -def parse(handle, format, seq_count=None, alphabet=None):
298 """Iterate over an alignment file as MultipleSeqAlignment objects. 299 300 Arguments: 301 - handle - handle to the file, or the filename as a string 302 (note older versions of Biopython only took a handle). 303 - format - string describing the file format. 304 - alphabet - optional Alphabet object, useful when the sequence type 305 cannot be automatically inferred from the file itself 306 (e.g. fasta, phylip, clustal) 307 - seq_count - Optional integer, number of sequences expected in each 308 alignment. Recommended for fasta format files. 309 310 If you have the file name in a string 'filename', use: 311 312 >>> from Bio import AlignIO 313 >>> filename = "Emboss/needle.txt" 314 >>> format = "emboss" 315 >>> for alignment in AlignIO.parse(filename, format): 316 ... print("Alignment of length %i" % alignment.get_alignment_length()) 317 Alignment of length 124 318 Alignment of length 119 319 Alignment of length 120 320 Alignment of length 118 321 Alignment of length 125 322 323 If you have a string 'data' containing the file contents, use: 324 325 from Bio import AlignIO 326 from StringIO import StringIO 327 my_iterator = AlignIO.parse(StringIO(data), format) 328 329 Use the Bio.AlignIO.read() function when you expect a single record only. 330 """ 331 from Bio import SeqIO 332 333 #Try and give helpful error messages: 334 if not isinstance(format, basestring): 335 raise TypeError("Need a string for the file format (lower case)") 336 if not format: 337 raise ValueError("Format required (lower case string)") 338 if format != format.lower(): 339 raise ValueError("Format string '%s' should be lower case" % format) 340 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 341 isinstance(alphabet, AlphabetEncoder)): 342 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 343 if seq_count is not None and not isinstance(seq_count, int): 344 raise TypeError("Need integer for seq_count (sequences per alignment)") 345 346 with as_handle(handle, 'rU') as fp: 347 #Map the file format to a sequence iterator: 348 if format in _FormatToIterator: 349 iterator_generator = _FormatToIterator[format] 350 if alphabet is None: 351 i = iterator_generator(fp, seq_count) 352 else: 353 try: 354 #Initially assume the optional alphabet argument is supported 355 i = iterator_generator(fp, seq_count, alphabet=alphabet) 356 except TypeError: 357 #It isn't supported. 358 i = _force_alphabet(iterator_generator(fp, seq_count), 359 alphabet) 360 361 elif format in SeqIO._FormatToIterator: 362 #Exploit the existing SeqIO parser to the dirty work! 363 i = _SeqIO_to_alignment_iterator(fp, format, 364 alphabet=alphabet, 365 seq_count=seq_count) 366 else: 367 raise ValueError("Unknown format '%s'" % format) 368 369 #This imposes some overhead... wait until we drop Python 2.4 to fix it 370 for a in i: 371 yield a
372 373
374 -def read(handle, format, seq_count=None, alphabet=None):
375 """Turns an alignment file into a single MultipleSeqAlignment object. 376 377 Arguments: 378 - handle - handle to the file, or the filename as a string 379 (note older versions of Biopython only took a handle). 380 - format - string describing the file format. 381 - alphabet - optional Alphabet object, useful when the sequence type 382 cannot be automatically inferred from the file itself 383 (e.g. fasta, phylip, clustal) 384 - seq_count - Optional integer, number of sequences expected in each 385 alignment. Recommended for fasta format files. 386 387 If the handle contains no alignments, or more than one alignment, 388 an exception is raised. For example, using a PFAM/Stockholm file 389 containing one alignment: 390 391 >>> from Bio import AlignIO 392 >>> filename = "Clustalw/protein.aln" 393 >>> format = "clustal" 394 >>> alignment = AlignIO.read(filename, format) 395 >>> print("Alignment of length %i" % alignment.get_alignment_length()) 396 Alignment of length 411 397 398 If however you want the first alignment from a file containing 399 multiple alignments this function would raise an exception. 400 401 >>> from Bio import AlignIO 402 >>> filename = "Emboss/needle.txt" 403 >>> format = "emboss" 404 >>> alignment = AlignIO.read(filename, format) 405 Traceback (most recent call last): 406 ... 407 ValueError: More than one record found in handle 408 409 Instead use: 410 411 >>> from Bio import AlignIO 412 >>> filename = "Emboss/needle.txt" 413 >>> format = "emboss" 414 >>> alignment = next(AlignIO.parse(filename, format)) 415 >>> print("First alignment has length %i" % alignment.get_alignment_length()) 416 First alignment has length 124 417 418 You must use the Bio.AlignIO.parse() function if you want to read multiple 419 records from the handle. 420 """ 421 iterator = parse(handle, format, seq_count, alphabet) 422 try: 423 first = next(iterator) 424 except StopIteration: 425 first = None 426 if first is None: 427 raise ValueError("No records found in handle") 428 try: 429 second = next(iterator) 430 except StopIteration: 431 second = None 432 if second is not None: 433 raise ValueError("More than one record found in handle") 434 if seq_count: 435 assert len(first) == seq_count 436 return first
437 438
439 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
440 """Convert between two alignment files, returns number of alignments. 441 442 - in_file - an input handle or filename 443 - in_format - input file format, lower case string 444 - output - an output handle or filename 445 - out_file - output file format, lower case string 446 - alphabet - optional alphabet to assume 447 448 NOTE - If you provide an output filename, it will be opened which will 449 overwrite any existing file without warning. This may happen if even the 450 conversion is aborted (e.g. an invalid out_format name is given). 451 """ 452 #TODO - Add optimised versions of important conversions 453 #For now just off load the work to SeqIO parse/write 454 with as_handle(in_file, 'rU') as in_handle: 455 #Don't open the output file until we've checked the input is OK: 456 alignments = parse(in_handle, in_format, None, alphabet) 457 458 #This will check the arguments and issue error messages, 459 #after we have opened the file which is a shame. 460 with as_handle(out_file, 'w') as out_handle: 461 count = write(alignments, out_handle, out_format) 462 463 return count
464 465 466 if __name__ == "__main__": 467 from Bio._utils import run_doctest 468 run_doctest() 469