Gene predictions to protein sequences
Contributed by Brad Chapman
Note: This requires the propose Biopython GFF module, which has not yet been integrated in Biopython. You will need to install it first to run the examples
Setting this up, we import the required modules and parse our input FASTA file into a standard python dictionary, using SeqIO. SeqIO is also used for writing the output file. We pass the output function a python generator, which SeqIO will convert into FASTA files one at a time. This allows the script to be generalized to arbitrarily large data sets without running into memory issues.
from __future__ import with_statement import sys import os import operator from Bio import SeqIO from Bio.SeqRecord import SeqRecord from BCBio import GFF def main(glimmer_file, ref_file): with open(ref_file) as in_handle: ref_recs = SeqIO.to_dict(SeqIO.parse(in_handle, "fasta")) base, ext = os.path.splitext(glimmer_file) out_file = "%s-proteins.fa" % base with open(out_file, "w") as out_handle: SeqIO.write(protein_recs(glimmer_file, ref_recs), out_handle, "fasta")
Next, we provide a function to parse the GlimmerHMM output. Thanks to the GFF library, this is very easy. We iterate over a group of lines at a time to handle very large output files, and provide as input the reference records we parsed earlier with SeqIO:
def glimmer_predictions(in_handle, ref_recs): """Parse Glimmer output, generating SeqRecord and SeqFeatures for predictions """ for rec in GFF.parse(in_handle, target_lines=1000, base_dict=ref_recs): yield rec
This is the meat of the program. We loop over each of the records in the GlimmerHMM output, then over each of the gene features in that record. Each of these SeqFeatures contains a gene prediction with exons present as sub_features of the top level feature. Using the location of these exons, we extract the nucleotide sequence for the current gene and append it to a list. We then add together the list to get the full sequence. If the gene is on the reverse strand, the sequence is reverse complemented. Finally, we translate the sequence to protein. A SeqRecord with the sequence and prediction identifier is yielded, providing the generator we used earlier to write the output file:
def protein_recs(glimmer_file, ref_recs): """Generate protein records from GlimmerHMM gene predictions. """ with open(glimmer_file) as in_handle: for rec in glimmer_predictions(in_handle, ref_recs): for feature in rec.features: seq_exons =  for cds in feature.sub_features: seq_exons.append(rec.seq[ cds.location.nofuzzy_start: cds.location.nofuzzy_end]) gene_seq = reduce(operator.add, seq_exons) if feature.strand == -1: gene_seq = gene_seq.reverse_complement() protein_seq = gene_seq.translate() yield SeqRecord(protein_seq, feature.qualifiers["ID"], "", "")
Compare the full scripts for:
This highlights how using standard libraries with standardized formats can save coding versus rolling your own. The GFF3 version is more general since it allows multiple contigs to be processed concurrently, and requires less custom code to solve the problem. This allows you to focus on the biological problem, and avoid bugs in the parsing code.
Example files to run the scripts are available thanks to Michael Kuhn: