[Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk?
cjfields at illinois.edu
Sat Jun 5 08:56:13 EDT 2010
On Jun 5, 2010, at 6:51 AM, Peter wrote:
> On Sat, Jun 5, 2010 at 11:59 AM, Chris Fields wrote:
>> On Jun 4, 2010, at 2:04 PM, Peter wrote:
>>> But (thus far) no sequence data is stored in HDF5 format (is it?).
>> There will be a presentation this year at BOSC on BioHDF (HDF5 for bioinformatics).
>> There is a website:
> It looks like they are making good progress - with SAM/BAM conversion to and
> from BioHDF in place. Still, as they say:
>>>> The current BioHDF distribution is a pipleline prototype designed to show
>>>> the suitability of HDF5 as a biological data store and to determine how to
>>>> best implement an HDF5-based bioinformatics pipeline. It is in source code
>>>> format only. The code builds a set of command-line tools which allow
>>>> uploading and extracting DNA/RNA sequence and alignment data from
>>>> next-generation gene sequencers. These files have been provided with the
>>>> same BSD license used by HDF5
>>>> Please be aware that the code contained in it will be in a high state of flux
>>>> in the immediate future.
> This certainly looks like something to keep an eye on.
> In any case, getting back to the thread's purpose - Bio.SeqIO.index() aims to
> give random access to sequences by their ID for many different file formats.
> There has been little interest in extending this to support gzipped
> files. However,
> extending the code to store the id/offset lookup table on disk with SQLite3
> (rather than in memory as a Python dict) would seem welcome. I'll be
> refreshing the github branch where I was working on this earlier in the year...
We have seen (on the bioperl side) some interest in allowing gzip/bzip and others in via the PerlIO layer, and also AnyDBM using SQLite. Mark Jensen actually did a little work along these lines, though I'm not sure how clear-cut the support is at the moment.
More information about the Biopython