Talk:Split fasta file
From Biopython
I like the idea for this cookbook example, but I don't like the implementation so much. Mainly because you start by loading the whole file into memory! Something using an iterator approach could keep only one record in memory at at time (or at least, only one batch in memory at a time). Peter
- That's what happens when you let a lab-rat write code! I should have thought about his but was keen to get something up as an example of how this would work. You approach below is obviously a much better, and more general one. I've added set it up here Split_large_file and made this page redirect to that one. Feel free to comment/edit that entry. --Davidw 09:28, 18 April 2009 (UTC)
We should retitle this example as the idea isn't FASTA specific - any SeqIO format would do.
How about this - note that the idea is actually very general:
def batch_iterator(iterator, batch_size) : """Returns lists of length batch_size. This can be used on any iterator, for example to batch up SeqRecord objects from Bio.SeqIO.parse(...), or to batch Alignment objects from Bio.AlignIO.parse(...), or simply lines from a file handle. This is a generator function, and it returns lists of the entries from the supplied iterator. Each list will have batch_size entries, although the final list may be shorter. """ entry = True #Make sure we loop once while entry : batch = [] while len(batch) < batch_size : try : entry = iterator.next() except StopIteration : entry = None if entry is None : #End of file break batch.append(iterator.next()) yield batch from Bio import SeqIO record_iter = SeqIO.parse(open("SRR014849.fastq"),"fastq") for i, batch in enumerate(batch_iterator(record_iter, 10000)) : filename = "group_%i.fastq" % (i+1) handle = open(filename, "w") count = SeqIO.write(batch, handle, "fastq") handle.close() print "Wrote %i records to %s" % (count, filename)
And the output using SRR014849.fastq from this compressed file at the NCBI.
Wrote 10000 records to group_1.fastq Wrote 10000 records to group_2.fastq Wrote 10000 records to group_3.fastq Wrote 10000 records to group_4.fastq Wrote 7348 records to group_5.fastq
You could tweak the final section to use filename labelled as in your example if you liked. Peter

