Package Bio :: Package Alphabet
[hide private]
[frames] | no frames]

Package Alphabet

source code

Alphabets used in Seq objects etc to declare sequence type and letters.

This is used by sequences which contain a finite number of similar words.

Submodules [hide private]

Classes [hide private]
  Alphabet
Generic alphabet base class.
  SingleLetterAlphabet
Generic alphabet with letters of size one.
  ProteinAlphabet
Generic single letter protein alphabet.
  NucleotideAlphabet
Generic single letter nucleotide alphabet.
  DNAAlphabet
Generic single letter DNA alphabet.
  RNAAlphabet
Generic single letter RNA alphabet.
  SecondaryStructure
Alphabet used to describe secondary structure.
  ThreeLetterProtein
Three letter protein alphabet.
  AlphabetEncoder
  Gapped
  HasStopCodon
Functions [hide private]
 
_get_base_alphabet(alphabet)
Returns the non-gapped non-stop-codon Alphabet object (PRIVATE).
source code
 
_ungap(alphabet)
Returns the alphabet without any gap encoder (PRIVATE).
source code
 
_consensus_base_alphabet(alphabets)
Returns a common but often generic base alphabet object (PRIVATE).
source code
 
_consensus_alphabet(alphabets)
Returns a common but often generic alphabet object (PRIVATE).
source code
 
_check_type_compatible(alphabets)
Returns True except for DNA+RNA or Nucleotide+Protein (PRIVATE).
source code
 
_verify_alphabet(sequence)
Check all letters in sequence are in the alphabet (PRIVATE).
source code
Variables [hide private]
  generic_alphabet = Alphabet()
  single_letter_alphabet = SingleLetterAlphabet()
  generic_protein = ProteinAlphabet()
  generic_nucleotide = NucleotideAlphabet()
  generic_dna = DNAAlphabet()
  generic_rna = RNAAlphabet()
  __package__ = None
hash(x)
Function Details [hide private]

_consensus_base_alphabet(alphabets)

source code 

Returns a common but often generic base alphabet object (PRIVATE).

This throws away any AlphabetEncoder information, e.g. Gapped alphabets.

Note that DNA+RNA -> Nucleotide, and Nucleotide+Protein-> generic single letter. These DO NOT raise an exception!

_consensus_alphabet(alphabets)

source code 

Returns a common but often generic alphabet object (PRIVATE).

>>> from Bio.Alphabet import IUPAC
>>> _consensus_alphabet([IUPAC.extended_protein, IUPAC.protein])
ExtendedIUPACProtein()
>>> _consensus_alphabet([generic_protein, IUPAC.protein])
ProteinAlphabet()

Note that DNA+RNA -> Nucleotide, and Nucleotide+Protein-> generic single letter. These DO NOT raise an exception!

>>> _consensus_alphabet([generic_dna, generic_nucleotide])
NucleotideAlphabet()
>>> _consensus_alphabet([generic_dna, generic_rna])
NucleotideAlphabet()
>>> _consensus_alphabet([generic_dna, generic_protein])
SingleLetterAlphabet()
>>> _consensus_alphabet([single_letter_alphabet, generic_protein])
SingleLetterAlphabet()

This is aware of Gapped and HasStopCodon and new letters added by other AlphabetEncoders. This WILL raise an exception if more than one gap character or stop symbol is present.

>>> from Bio.Alphabet import IUPAC
>>> _consensus_alphabet([Gapped(IUPAC.extended_protein), HasStopCodon(IUPAC.protein)])
HasStopCodon(Gapped(ExtendedIUPACProtein(), '-'), '*')
>>> _consensus_alphabet([Gapped(IUPAC.protein, "-"), Gapped(IUPAC.protein, "=")])
Traceback (most recent call last):
    ...
ValueError: More than one gap character present
>>> _consensus_alphabet([HasStopCodon(IUPAC.protein, "*"), HasStopCodon(IUPAC.protein, "+")])
Traceback (most recent call last):
    ...
ValueError: More than one stop symbol present

_check_type_compatible(alphabets)

source code 

Returns True except for DNA+RNA or Nucleotide+Protein (PRIVATE).

>>> _check_type_compatible([generic_dna, generic_nucleotide])
True
>>> _check_type_compatible([generic_dna, generic_rna])
False
>>> _check_type_compatible([generic_dna, generic_protein])
False
>>> _check_type_compatible([single_letter_alphabet, generic_protein])
True

This relies on the Alphabet subclassing hierarchy. It does not check things like gap characters or stop symbols.

_verify_alphabet(sequence)

source code 

Check all letters in sequence are in the alphabet (PRIVATE).

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
...              IUPAC.protein)
>>> _verify_alphabet(my_seq)
True

This example has an X, which is not in the IUPAC protein alphabet (you should be using the IUPAC extended protein alphabet):

>>> bad_seq = Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVFX",
...                IUPAC.protein)
>>> _verify_alphabet(bad_seq)
False

This replaces Bio.utils.verify_alphabet() since we are deprecating that. Potentially this could be added to the Alphabet object, and I would like it to be an option when creating a Seq object... but that might slow things down.