Package Bio :: Package GenBank
[hide private]
[frames] | no frames]

Package GenBank

source code

Code to work with GenBank formatted files.

Rather than using Bio.GenBank, you are now encouraged to use Bio.SeqIO with
the "genbank" or "embl" format names to parse GenBank or EMBL files into
SeqRecord and SeqFeature objects (see the Biopython tutorial for details).

Using Bio.GenBank directly to parse GenBank files is only useful if you want
to obtain GenBank-specific Record objects, which is a much closer
representation to the raw file contents that the SeqRecord alternative from
the FeatureParser (used in Bio.SeqIO).

To use the Bio.GenBank parser, there are two helper functions:

read                  Parse a handle containing a single GenBank record
                      as Bio.GenBank specific Record objects.
parse                 Iterate over a handle containing multiple GenBank
                      records as Bio.GenBank specific Record objects.

The following internal classes are not intended for direct use and may
be deprecated in a future release.

Classes:
Iterator              Iterate through a file of GenBank entries
ErrorFeatureParser    Catch errors caused during parsing.
FeatureParser         Parse GenBank data in SeqRecord and SeqFeature objects.
RecordParser          Parse GenBank data into a Record object.

Exceptions:
ParserFailureError    Exception indicating a failure in the parser (ie.
                      scanner or consumer)
LocationParserError   Exception indiciating a problem with the spark based
                      location parser.

Submodules [hide private]

Classes [hide private]
  Iterator
Iterator interface to move over a file of GenBank entries one at a time (OBSOLETE).
  ParserFailureError
Failure caused by some kind of problem in the parser.
  LocationParserError
Could not Properly parse out a location from a GenBank file.
  FeatureParser
Parse GenBank files into Seq + Feature objects (OBSOLETE).
  RecordParser
Parse GenBank files into Record objects (OBSOLETE).
  _BaseGenBankConsumer
Abstract GenBank consumer providing useful general functions (PRIVATE).
  _FeatureConsumer
Create a SeqRecord object with Features to return (PRIVATE).
  _RecordConsumer
Create a GenBank Record object from scanner generated information (PRIVATE).
Functions [hide private]
 
_pos(pos_str, offset=0)
Build a Position object (PRIVATE).
source code
 
_loc(loc_str, expected_seq_length, strand)
FeatureLocation from non-compound non-complement location (PRIVATE).
source code
 
_split_compound_loc(compound_loc)
Split a tricky compound location string (PRIVATE).
source code
 
parse(handle)
Iterate over GenBank formatted entries as Record objects.
source code
 
read(handle)
Read a handle containing a single GenBank entry as a Record object.
source code
 
_test()
Run the Bio.GenBank module's doctests.
source code
Variables [hide private]
  GENBANK_INDENT = 12
  GENBANK_SPACER = ' '
  FEATURE_KEY_INDENT = 5
  FEATURE_QUALIFIER_INDENT = 21
  FEATURE_KEY_SPACER = ' '
  FEATURE_QUALIFIER_SPACER = ' '
  _solo_location = '[<>]?\\d+'
  _pair_location = '[<>]?\\d+\\.\\.[<>]?\\d+'
  _between_location = '\\d+\\^\\d+'
  _within_position = '\\(\\d+\\.\\d+\\)'
  _re_within_position = re.compile(r'\(\d+\.\d+\)')
  _within_location = '([<>]?\\d+|\\(\\d+\\.\\d+\\))\\.\\.([<>]?\...
  _oneof_position = 'one\\-of\\(\\d+(,\\d+)+\\)'
  _re_oneof_position = re.compile(r'one-of\(\d+(,\d+)+\)')
  _oneof_location = '([<>]?\\d+|one\\-of\\(\\d+(,\\d+)+\\))\\.\\...
  _simple_location = '\\d+\\.\\.\\d+'
  _re_simple_location = re.compile(r'^\d+\.\.\d+$')
  _re_simple_compound = re.compile(r'^(join|order|bond)\(\d+\.\....
  _complex_location = '([a-zA-z][a-zA-Z0-9_]*(\\.[a-zA-Z0-9]+)?\...
  _re_complex_location = re.compile(r'^([a-zA-z][a-zA-Z0-9_]*(\....
  _possibly_complemented_complex_location = '(([a-zA-z][a-zA-Z0-...
  _re_complex_compound = re.compile(r'^(join|order|bond)\((([a-z...
  __package__ = 'Bio.GenBank'
Function Details [hide private]

_pos(pos_str, offset=0)

source code 
Build a Position object (PRIVATE).

For an end position, leave offset as zero (default):

>>> _pos("5")
ExactPosition(5)

For a start position, set offset to minus one (for Python counting):

>>> _pos("5", -1)
ExactPosition(4)

This also covers fuzzy positions:

>>> p = _pos("<5")
>>> p
BeforePosition(5)
>>> print(p)
<5
>>> int(p)
5

>>> _pos(">5")
AfterPosition(5)

By default assumes an end position, so note the integer behaviour:

>>> p = _pos("one-of(5,8,11)")
>>> p
OneOfPosition(11, choices=[ExactPosition(5), ExactPosition(8), ExactPosition(11)])
>>> print(p)
one-of(5,8,11)
>>> int(p)
11

>>> _pos("(8.10)")
WithinPosition(10, left=8, right=10)

Fuzzy start positions:

>>> p = _pos("<5", -1)
>>> p
BeforePosition(4)
>>> print(p)
<4
>>> int(p)
4

Notice how the integer behaviour changes too!

>>> p = _pos("one-of(5,8,11)", -1)
>>> p
OneOfPosition(4, choices=[ExactPosition(4), ExactPosition(7), ExactPosition(10)])
>>> print(p)
one-of(4,7,10)
>>> int(p)
4

_loc(loc_str, expected_seq_length, strand)

source code 
FeatureLocation from non-compound non-complement location (PRIVATE).

Simple examples,

>>> _loc("123..456", 1000, +1)
FeatureLocation(ExactPosition(122), ExactPosition(456), strand=1)
>>> _loc("<123..>456", 1000, strand = -1)
FeatureLocation(BeforePosition(122), AfterPosition(456), strand=-1)

A more complex location using within positions,

>>> _loc("(9.10)..(20.25)", 1000, 1)
FeatureLocation(WithinPosition(8, left=8, right=9), WithinPosition(25, left=20, right=25), strand=1)

Notice how that will act as though it has overall start 8 and end 25.

Zero length between feature,

>>> _loc("123^124", 1000, 0)
FeatureLocation(ExactPosition(123), ExactPosition(123), strand=0)

The expected sequence length is needed for a special case, a between
position at the start/end of a circular genome:

>>> _loc("1000^1", 1000, 1)
FeatureLocation(ExactPosition(1000), ExactPosition(1000), strand=1)

Apart from this special case, between positions P^Q must have P+1==Q,

>>> _loc("123^456", 1000, 1)
Traceback (most recent call last):
   ...
ValueError: Invalid between location '123^456'

_split_compound_loc(compound_loc)

source code 
Split a tricky compound location string (PRIVATE).

>>> list(_split_compound_loc("123..145"))
['123..145']
>>> list(_split_compound_loc("123..145,200..209"))
['123..145', '200..209']
>>> list(_split_compound_loc("one-of(200,203)..300"))
['one-of(200,203)..300']
>>> list(_split_compound_loc("complement(123..145),200..209"))
['complement(123..145)', '200..209']
>>> list(_split_compound_loc("123..145,one-of(200,203)..209"))
['123..145', 'one-of(200,203)..209']
>>> list(_split_compound_loc("123..145,one-of(200,203)..one-of(209,211),300"))
['123..145', 'one-of(200,203)..one-of(209,211)', '300']
>>> list(_split_compound_loc("123..145,complement(one-of(200,203)..one-of(209,211)),300"))
['123..145', 'complement(one-of(200,203)..one-of(209,211))', '300']
>>> list(_split_compound_loc("123..145,200..one-of(209,211),300"))
['123..145', '200..one-of(209,211)', '300']
>>> list(_split_compound_loc("123..145,200..one-of(209,211)"))
['123..145', '200..one-of(209,211)']
>>> list(_split_compound_loc("complement(149815..150200),complement(293787..295573),NC_016402.1:6618..6676,181647..181905"))
['complement(149815..150200)', 'complement(293787..295573)', 'NC_016402.1:6618..6676', '181647..181905']

parse(handle)

source code 
Iterate over GenBank formatted entries as Record objects.

>>> from Bio import GenBank
>>> with open("GenBank/NC_000932.gb") as handle:
...     for record in GenBank.parse(handle):
...         print(record.accession)
['NC_000932']

To get SeqRecord objects use Bio.SeqIO.parse(..., format="gb")
instead.

read(handle)

source code 
Read a handle containing a single GenBank entry as a Record object.

>>> from Bio import GenBank
>>> with open("GenBank/NC_000932.gb") as handle:
...     record = GenBank.read(handle)
...     print(record.accession)
['NC_000932']

To get a SeqRecord object use Bio.SeqIO.read(..., format="gb")
instead.


Variables Details [hide private]

_within_location

Value:
'([<>]?\\d+|\\(\\d+\\.\\d+\\))\\.\\.([<>]?\\d+|\\(\\d+\\.\\d+\\))'

_oneof_location

Value:
'([<>]?\\d+|one\\-of\\(\\d+(,\\d+)+\\))\\.\\.([<>]?\\d+|one\\-of\\(\\d\
+(,\\d+)+\\))'

_re_simple_compound

Value:
re.compile(r'^(join|order|bond)\(\d+\.\.\d+(,\d+\.\.\d+)*\)$')

_complex_location

Value:
'([a-zA-z][a-zA-Z0-9_]*(\\.[a-zA-Z0-9]+)?\\:)?([<>]?\\d+\\.\\.[<>]?\\d\
+|[<>]?\\d+|\\d+\\^\\d+|([<>]?\\d+|\\(\\d+\\.\\d+\\))\\.\\.([<>]?\\d+|\
\\(\\d+\\.\\d+\\))|([<>]?\\d+|one\\-of\\(\\d+(,\\d+)+\\))\\.\\.([<>]?\\
\d+|one\\-of\\(\\d+(,\\d+)+\\)))'

_re_complex_location

Value:
re.compile(r'^([a-zA-z][a-zA-Z0-9_]*(\.[a-zA-Z0-9]+)?:)?([<>]?\d+\.\.[\
<>]?\d+|[<>]?\d+|\d+\^\d+|([<>]?\d+|\(\d+\.\d+\))\.\.([<>]?\d+|\(\d+\.\
\d+\))|([<>]?\d+|one-of\(\d+(,\d+)+\))\.\.([<>]?\d+|one-of\(\d+(,\d+)+\
\)))$')

_possibly_complemented_complex_location

Value:
'(([a-zA-z][a-zA-Z0-9_]*(\\.[a-zA-Z0-9]+)?\\:)?([<>]?\\d+\\.\\.[<>]?\\\
d+|[<>]?\\d+|\\d+\\^\\d+|([<>]?\\d+|\\(\\d+\\.\\d+\\))\\.\\.([<>]?\\d+\
|\\(\\d+\\.\\d+\\))|([<>]?\\d+|one\\-of\\(\\d+(,\\d+)+\\))\\.\\.([<>]?\
\\d+|one\\-of\\(\\d+(,\\d+)+\\)))|complement\\(([a-zA-z][a-zA-Z0-9_]*(\
\\.[a-zA-Z0-9]+)?\\:)?([<>]?\\d+\\.\\.[<>]?\\d+|[<>]?\\d+|\\d+\\^\\d+|\
([<>]?\\d+|\\(\\d+\\.\\d+\\))\\.\\.([<>]?\\d+|\\(\\d+\\.\\d+\\))|([<>]\
?\\d+|one\\-of\\(\\d+(,\\d+)+\\))\\.\\.([<>]?\\d+|one\\-of\\(\\d+(,\\d\
+)+\\)))\\))'

_re_complex_compound

Value:
re.compile(r'^(join|order|bond)\((([a-zA-z][a-zA-Z0-9_]*(\.[a-zA-Z0-9]\
+)?:)?([<>]?\d+\.\.[<>]?\d+|[<>]?\d+|\d+\^\d+|([<>]?\d+|\(\d+\.\d+\))\\
.\.([<>]?\d+|\(\d+\.\d+\))|([<>]?\d+|one-of\(\d+(,\d+)+\))\.\.([<>]?\d\
+|one-of\(\d+(,\d+)+\)))|complement\(([a-zA-z][a-zA-Z0-9_]*(\.[a-zA-Z0\
-9]+)?:)?([<>]?\d+\.\.[<>]?\d+|[<>]?\d+|\d+\^\d+|([<>]?\d+|\(\d+\.\d+\\
))\.\.([<>]?\d+|\(\d+\.\d+\))|([<>]?\d+|one-of\(\d+(,\d+)+\))\.\.([<>]\
?\d+|one-of\(\d+(,\d+)+\)))\))(,(([a-zA-z][a-zA-Z0-9_]*(\.[a-zA-Z0-9]+\
)?:)?([<>]?\d+\.\.[<>]?\d+|[<>]?\d+|\d+\^\d+|([<>]?\d+|\(\d+\.\d+\))\.\
...