Package Bio :: Package GenBank :: Module Scanner :: Class InsdcScanner
[hide private]
[frames] | no frames]

Class InsdcScanner

source code

object --+
         |
        InsdcScanner
Known Subclasses:

Basic functions for breaking up a GenBank/EMBL file into sub sections.

The International Nucleotide Sequence Database Collaboration (INSDC)
between the DDBJ, EMBL, and GenBank.  These organisations all use the
same "Feature Table" layout in their plain text flat file formats.

However, the header and sequence sections of an EMBL file are very
different in layout to those produced by GenBank/DDBJ.

Instance Methods [hide private]
 
__init__(self, debug=0)
x.__init__(...) initializes x; see help(type(x)) for signature
source code
 
set_handle(self, handle) source code
 
find_start(self)
Read in lines until find the ID/LOCUS line, which is returned.
source code
 
parse_header(self)
Return list of strings making up the header
source code
 
parse_features(self, skip=False)
Return list of tuples for the features (if present)
source code
 
parse_feature(self, feature_key, lines)
Expects a feature as a list of strings, returns a tuple (key, location, qualifiers)
source code
 
parse_footer(self)
returns a tuple containing a list of any misc strings, and the sequence
source code
 
_feed_first_line(self, consumer, line)
Handle the LOCUS/ID line, passing data to the comsumer
source code
 
_feed_header_lines(self, consumer, lines)
Handle the header lines (list of strings), passing data to the comsumer
source code
 
_feed_feature_table(self, consumer, feature_tuples)
Handle the feature table (list of tuples), passing data to the comsumer
source code
 
_feed_misc_lines(self, consumer, lines)
Handle any lines between features and sequence (list of strings), passing data to the consumer
source code
 
feed(self, handle, consumer, do_features=True)
Feed a set of data into the consumer.
source code
 
parse(self, handle, do_features=True)
Returns a SeqRecord (with SeqFeatures if do_features=True)
source code
 
parse_records(self, handle, do_features=True)
Returns a SeqRecord object iterator
source code
 
parse_cds_features(self, handle, alphabet=ProteinAlphabet(), tags2id=('protein_id', 'locus_tag', 'product'))
Returns SeqRecord object iterator
source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Class Variables [hide private]
  RECORD_START = 'XXX'
  HEADER_WIDTH = 3
  FEATURE_START_MARKERS = ['XXX***FEATURES***XXX']
  FEATURE_END_MARKERS = ['XXX***END FEATURES***XXX']
  FEATURE_QUALIFIER_INDENT = 0
  FEATURE_QUALIFIER_SPACER = ''
  SEQUENCE_HEADERS = ['XXX']
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, debug=0)
(Constructor)

source code 
x.__init__(...) initializes x; see help(type(x)) for signature

Overrides: object.__init__
(inherited documentation)

find_start(self)

source code 
Read in lines until find the ID/LOCUS line, which is returned.

Any preamble (such as the header used by the NCBI on *.seq.gz archives)
will we ignored.

parse_header(self)

source code 
Return list of strings making up the header

New line characters are removed.

Assumes you have just read in the ID/LOCUS line.

parse_features(self, skip=False)

source code 
Return list of tuples for the features (if present)

Each feature is returned as a tuple (key, location, qualifiers)
where key and location are strings (e.g. "CDS" and
"complement(join(490883..490885,1..879))") while qualifiers
is a list of two string tuples (feature qualifier keys and values).

Assumes you have already read to the start of the features table.

parse_feature(self, feature_key, lines)

source code 
Expects a feature as a list of strings, returns a tuple (key, location, qualifiers)

        For example given this GenBank feature:

             CDS             complement(join(490883..490885,1..879))
                             /locus_tag="NEQ001"
                             /note="conserved hypothetical [Methanococcus jannaschii];
                             COG1583:Uncharacterized ACR; IPR001472:Bipartite nuclear
                             localization signal; IPR002743: Protein of unknown
                             function DUF57"
                             /codon_start=1
                             /transl_table=11
                             /product="hypothetical protein"
                             /protein_id="NP_963295.1"
                             /db_xref="GI:41614797"
                             /db_xref="GeneID:2732620"
                             /translation="MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK
                             EKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTK
                             KFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEP
                             IEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFE
                             EAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGS
                             LNSMGFGFVNTKKNSAR"

        Then should give input key="CDS" and the rest of the data as a list of strings
        lines=["complement(join(490883..490885,1..879))", ..., "LNSMGFGFVNTKKNSAR"]
        where the leading spaces and trailing newlines have been removed.

        Returns tuple containing: (key as string, location string, qualifiers as list)
        as follows for this example:

        key = "CDS", string
        location = "complement(join(490883..490885,1..879))", string
        qualifiers = list of string tuples:

        [('locus_tag', '"NEQ001"'),
         ('note', '"conserved hypothetical [Methanococcus jannaschii];
COG1583:..."'),
         ('codon_start', '1'),
         ('transl_table', '11'),
         ('product', '"hypothetical protein"'),
         ('protein_id', '"NP_963295.1"'),
         ('db_xref', '"GI:41614797"'),
         ('db_xref', '"GeneID:2732620"'),
         ('translation', '"MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK
EKYFNFT..."')]

        In the above example, the "note" and "translation" were edited for compactness,
        and they would contain multiple new line characters (displayed above as 
)

        If a qualifier is quoted (in this case, everything except codon_start and
        transl_table) then the quotes are NOT removed.

        Note that no whitespace is removed.
        

_feed_first_line(self, consumer, line)

source code 
Handle the LOCUS/ID line, passing data to the comsumer

This should be implemented by the EMBL / GenBank specific subclass

Used by the parse_records() and parse() methods.

_feed_header_lines(self, consumer, lines)

source code 
Handle the header lines (list of strings), passing data to the comsumer

This should be implemented by the EMBL / GenBank specific subclass

Used by the parse_records() and parse() methods.

_feed_feature_table(self, consumer, feature_tuples)

source code 
Handle the feature table (list of tuples), passing data to the comsumer

Used by the parse_records() and parse() methods.

_feed_misc_lines(self, consumer, lines)

source code 
Handle any lines between features and sequence (list of strings), passing data to the consumer

This should be implemented by the EMBL / GenBank specific subclass

Used by the parse_records() and parse() methods.

feed(self, handle, consumer, do_features=True)

source code 
Feed a set of data into the consumer.

This method is intended for use with the "old" code in Bio.GenBank

Arguments:
handle - A handle with the information to parse.
consumer - The consumer that should be informed of events.
do_features - Boolean, should the features be parsed?
              Skipping the features can be much faster.

Return values:
true  - Passed a record
false - Did not find a record

parse(self, handle, do_features=True)

source code 
Returns a SeqRecord (with SeqFeatures if do_features=True)

See also the method parse_records() for use on multi-record files.

parse_records(self, handle, do_features=True)

source code 
Returns a SeqRecord object iterator

Each record (from the ID/LOCUS line to the // line) becomes a SeqRecord

The SeqRecord objects include SeqFeatures if do_features=True

This method is intended for use in Bio.SeqIO

parse_cds_features(self, handle, alphabet=ProteinAlphabet(), tags2id=('protein_id', 'locus_tag', 'product'))

source code 
Returns SeqRecord object iterator

Each CDS feature becomes a SeqRecord.

alphabet - Used for any sequence found in a translation field.
tags2id  - Tupple of three strings, the feature keys to use
           for the record id, name and description,

This method is intended for use in Bio.SeqIO