Split fasta file

From Biopython
Revision as of 05:06, 17 April 2009 by Davidw (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

PLAN is a free online service for BLAST based sequence annotation provided by the Noble institute. At present the people that run PLAN limit user's queries to 1000 records. This recipe shows how to split a fasta file containing thousands of records into smaller files with filenames that include the range of records included in the file.

from Bio import SeqIO
in_name = "huge_run.fasta"
records = list(SeqIO.parse(open(in_name, 'r'), "fasta"))
# How many files of exactly 1000 records can we make?
nfiles = len(records)/1000
for filenumber in range(nfiles):
	split_min = filenumber*1000 
	split_max = ((filenumber+1)*1000)-1
	seqs = records[split_min:split_max]
	out_name = ("_%s-%s." % (split_min+1, split_max+1)).join(in_name.split("."))
  	out_handle = open(out_name, 'w')
	SeqIO.write(seqs, out_handle, "fasta")
# That will make a bunch of files with 1000 records, but you will almost always 
# have some "dregs" to mop up. 
if len(records) / 1000 != 0: #checking for leftovers
	lsplit_min = (nfiles+1)*1000
	last_name = ("_%s-%s." % (lsplit_min, len(records)-1)).join(in_name.split("."))
	last_handle = open(last_name, 'w')
	lseqs = records[lsplit_min:-1]
	SeqIO.write(lseqs, last_handle, "fasta")

The basic approach is do use SeqIO.parse() to read the contents of a fasta file into a list, then pick out subsets to write with SeqIO.write() . Perhaps the trickiest looking line is the one that sets each new file's name:

out_name = ("_%s-%s." % (split_min+1, split_max+1)).join(in_name.split("."))

which can be broken down thus:

>>> insert = "_%s-%s." % (1, 1000)
>>> insert
>>> split_name = "huge_run.fasta".split(".")
['huge_run', 'fasta']
Personal tools