Sequence Cleaner

From Biopython
(Difference between revisions)
Jump to: navigation, search
(Script)
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Description==
 
==Description==
  
I want to  share my script using biopython to clean sequences up , you should know analyzing poor data takes CPU time and interpreting the results from poor data takes people time, so it's always important to make a preprocessing.  
+
I want to  share my script using biopython to clean sequences up. You should know that analyzing poor data takes CPU time and interpreting the results from poor data takes people time, so it's always important to make a preprocessing.  
 
   
 
   
 
Let me call my script as “Sequence_cleaner” and the big idea is to remove duplicate sequences, remove too short sequences ( the user defines the minimum length)  and remove sequences which have too many unknown nucleotides (N)  ( the user defines the % of N is allows ) and in the end the user can choose if he/she wants to have a file as output or print the result.
 
Let me call my script as “Sequence_cleaner” and the big idea is to remove duplicate sequences, remove too short sequences ( the user defines the minimum length)  and remove sequences which have too many unknown nucleotides (N)  ( the user defines the % of N is allows ) and in the end the user can choose if he/she wants to have a file as output or print the result.
Line 7: Line 7:
 
== Script==
 
== Script==
 
<python>
 
<python>
 +
import sys
 
from Bio import SeqIO
 
from Bio import SeqIO
 
   
 
   
Line 12: Line 13:
 
     #create our hash table to add the sequences
 
     #create our hash table to add the sequences
 
     sequences={}
 
     sequences={}
 
+
 
     #Using the biopython fasta parse we can read our fasta input
 
     #Using the biopython fasta parse we can read our fasta input
 
     for seq_record in SeqIO.parse(fasta_file, "fasta"):
 
     for seq_record in SeqIO.parse(fasta_file, "fasta"):
Line 25: Line 26:
 
             else:
 
             else:
 
                 sequences[sequence]+="_"+seq_record.id
 
                 sequences[sequence]+="_"+seq_record.id
 +
  
           
 
 
     #Write the clean sequences
 
     #Write the clean sequences
               
+
 
     #Create a file in the same directory where you ran this script
 
     #Create a file in the same directory where you ran this script
 
     output_file=open("clear_"+fasta_file,"w+")
 
     output_file=open("clear_"+fasta_file,"w+")
Line 35: Line 36:
 
             output_file.write(">"+sequences[sequence]+"\n"+sequence+"\n")
 
             output_file.write(">"+sequences[sequence]+"\n"+sequence+"\n")
 
     output_file.close()
 
     output_file.close()
 +
   
 +
    print "CLEAN!!!\nPlease check clear_"+fasta_file
  
  
 +
userParameters=sys.argv[1:]
 +
 +
try:
 +
    if len(userParameters)==1:
 +
        sequence_cleaner(userParameters[0])
 +
    elif len(userParameters)==2:
 +
        sequence_cleaner(userParameters[0],float(userParameters[1]))
 +
    elif len(userParameters)==3:
 +
        sequence_cleaner(userParameters[0],float(userParameters[1]),float(userParameters[2]))
 +
    else:
 +
        print "There is a problem!"
 +
except:
 +
    print "There is a problem!"
 +
 +
 +
#python sequence_cleaner.py Aip_coral.fasta 10 10
  
#sequence_cleaner("Aip_coral.fasta",10,10)
 
  
"You should call the function 'sequence_cleaner', there are 3 basic parameters:
 
"                    #1st: your fasta file
 
"                    #2nd: the user defines the minimum length (default value 0 ( It means you don't have to care about the minimum lenght)
 
"                    #3rd: the user defines the % of N is allowed (default value 100 ( It means  you dont care to 'N' in your sequences))
 
"FYI if you don't care about the 2nd and the 3rd parameters you are just gonna remove the duplicate sequences"
 
 
</python>
 
</python>
  
== Questions==
+
Using command line, you should run python sequence_cleaner.py INPUT-(1st) MIN_LENGHT-(2nd) MIN_%-(3rd) - there are 3 basic parameters:
Any question send an email to: genivaldo.gueiros@gmail.com
+
        #1st: your fasta file
 +
        #2nd: the user defines the minimum length (default value 0 ( It means you don't have to care about the minimum lenght)
 +
        #3rd: the user defines the % of N is allowed (default value 100 ( It means  you dont care to 'N' in your sequences))
 +
 
 +
        For exemple: python sequence_cleaner.py Aip_coral.fasta 10 10
 +
FYI: if you don't care about the 2nd and the 3rd parameters you are just gonna remove the duplicate sequences
 +
 
 +
== Questions, Suggestions or Improvement==
 +
Send an email to genivaldo.gueiros@gmail.com
  
  
 
[[Category:Cookbook]]
 
[[Category:Cookbook]]

Latest revision as of 01:48, 15 June 2013

Description

I want to share my script using biopython to clean sequences up. You should know that analyzing poor data takes CPU time and interpreting the results from poor data takes people time, so it's always important to make a preprocessing.

Let me call my script as “Sequence_cleaner” and the big idea is to remove duplicate sequences, remove too short sequences ( the user defines the minimum length) and remove sequences which have too many unknown nucleotides (N) ( the user defines the % of N is allows ) and in the end the user can choose if he/she wants to have a file as output or print the result.

Script

import sys
from Bio import SeqIO
 
def sequence_cleaner(fasta_file,min_length=0,por_n=100):
    #create our hash table to add the sequences
    sequences={}
 
    #Using the biopython fasta parse we can read our fasta input
    for seq_record in SeqIO.parse(fasta_file, "fasta"):
        #Take the current sequence
        sequence=str(seq_record.seq).upper()
        #Check if the current sequence is according to the user parameters
        if (len(sequence)>=min_length and (float(sequence.count("N"))/float(len(sequence)))*100<=por_n):
       # If the sequence passed in the test "is It clean?" and It isnt in the hash table , the sequence and Its id are going to be in the hash
            if sequence not in sequences:
                sequences[sequence]=seq_record.id
       #If It is already in the hash table, We're just gonna concatenate the ID of the current sequence to another one that is already in the hash table
            else:
                sequences[sequence]+="_"+seq_record.id
 
 
    #Write the clean sequences
 
    #Create a file in the same directory where you ran this script
    output_file=open("clear_"+fasta_file,"w+")
    #Just Read the Hash Table and write on the file as a fasta format
    for sequence in sequences:
            output_file.write(">"+sequences[sequence]+"\n"+sequence+"\n")
    output_file.close()
 
    print "CLEAN!!!\nPlease check clear_"+fasta_file
 
 
userParameters=sys.argv[1:]
 
try:
    if len(userParameters)==1:
        sequence_cleaner(userParameters[0])
    elif len(userParameters)==2:
        sequence_cleaner(userParameters[0],float(userParameters[1]))
    elif len(userParameters)==3:
        sequence_cleaner(userParameters[0],float(userParameters[1]),float(userParameters[2]))
    else:
        print "There is a problem!"
except:
    print "There is a problem!"
 
 
#python sequence_cleaner.py Aip_coral.fasta 10 10

Using command line, you should run python sequence_cleaner.py INPUT-(1st) MIN_LENGHT-(2nd) MIN_%-(3rd) - there are 3 basic parameters:

        #1st: your fasta file 
        #2nd: the user defines the minimum length (default value 0 ( It means you don't have to care about the minimum lenght)
        #3rd: the user defines the % of N is allowed (default value 100 ( It means  you dont care to 'N' in your sequences))
        For exemple: python sequence_cleaner.py Aip_coral.fasta 10 10

FYI: if you don't care about the 2nd and the 3rd parameters you are just gonna remove the duplicate sequences

Questions, Suggestions or Improvement

Send an email to genivaldo.gueiros@gmail.com

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox