Introduction to BioPython¶
Python offers a variety of functions to work with text data (Strings) that make it easier to work with biological data such as DNA or protein sequences. BioPython library provides a set of classes that dedicated to parsing and analysis of different type of biological data. The functions avaiable in BioPython helps researcher to progammatically process the data. Below we’ll see some of the features in Biopython for working with sequence and structure data. To install Biopython library run pip install biopython
. For more details regarding Biopython installation and tutorials, please refer to the Biopython wiki.
To check the version of Biopython, run the following command.
import Bio
print(Bio.__version__)
1.78
Sequence Object¶
To work with sequences, we’ll need the Bio.Seq
class which has the required functions for reading and writing sequence data. Once we have imported this class we can create objects having required data. The example below shows constructing a sequence object with a DNA sequence and then using the complement()
and translate()
functions to find the sequence of the complementary strand and the translated protein sequence, respectively.
from Bio.Seq import Seq
new_seqeunce = Seq('AATTGGAACCTT')
print("Original sequence:", new_seqeunce)
print("The sequence for the complementary strand is: ",new_seqeunce.complement())
print("The translated protein sequence would be: ",new_seqeunce.translate())
Original sequence: AATTGGAACCTT
The sequence for the complementary strand is: TTAACCTTGGAA
The translated protein sequence would be: NWNL
To create a sequence object by reading sequence from a file, we can use the SeqIO
class. The parse()
function in this class can read and write sequences in different formats. This function take two arguments - file name and format, and return an iterator having all the sequences. The code below shows reading a file having multiple sequences and printing the sequences using the seq attribute. Note that this would print sequnces without any annotations. The description
attribute of SeqIO object can used to print the description of a sequence as given in the input file. The sequence file used in the code below is available here.
from Bio import SeqIO
count=0
for all_seqs in SeqIO.parse("kinases.txt", "fasta"):
print(all_seqs.description)
print(all_seqs.seq)
count += 1
if(count == 3):
break
consensus seq
YELLKVL-GKGAFGE--------------------------------------------------------------------V-Y---KAR-D-----------KD-------------------T------G-----E--E--VAVKV----LK-K---G----E-SS----------------K--K-R--K-------------E-------F-LR-EIKI-LKK------L------------------R--------H---PNIVRLYGVF-----------Q-----E-D----------E----D----P----LY--------LVME-YM-------P-GG----SLFD---L---L-R---K--R---R------------------------------------------------------------------------------G-------L-------S-E-------KTLRFIAAQIASGLEYLH-S-K-GI---------IHRDLKPENIL---L--------D--------------------------S------------------D-G-HVKLADF---GL---AR-----------E-----------L-Y----------S---D-------D---------------------------------G-Y--R-------------------T----------T-----T-FVG-T----P-------------------RY--MAPEV--L---------L-------G-------------GG--Y-SKKSDVWSL---GVLLYELLTG---------G--K---P-----P-F------P----G-E-------S---------------NE---E----LL----E-----------KIL------K---------------------G--------------------Y----------------------R---------------------L---P-I-P---E--P----D----C-----S-----P--E---AK---DLIKK--------CLQK-DPEKRPT
sp|Q22RR1||agc:agc-sar|Tetrahymena sp|Q22RR1||agc:agc-sar|Tetrahymena
LVYIKEL-GRGNFGS--------------------------------------------------------------------V-N---LVN-E-----------KQ-------------------K------D-----R--L--FAMKV----FI-S---N----Y-IFqr--------------G--I-E--K-------------Y-------I-LR-EKST-LQK------C------------------N--------F---PFIMRYYRSF-----------Y-------D----------N----Y----H----IY--------FMNE-FI-------N-GM----DFFT---V---M-R---E--V---G------------------------------------------------------------------------------L-------F-------N-K-------QQAQFYTAFIILTLQYLN-N-Q-GI---------VYRDLKPENIL---I--------D--------------------------H------------------E-G-WPKLVDM---GT---AK-----------Y-----------L------------Y---D-------Kn--------------------------------Q-Q--L-------------------T----------Rty---S-LVG-T----P-------------------HY--MAPEV--I---------Q-------Q-------------KG--Y-GFAVDIYSL---GVILYELLV----------G--Y---L-----P-Y------Ged--V-D-------D---------------PI---E----VY----Q-----------LIL------E---------------------G-------------------------------------------R---------------------L---G-F-P---N--H----M----K-----N-----R--L---SK---KLISQ--------LMAK-SPEVRLG
sp|Q234E6||agc:agc-sar|Tetrahymena sp|Q234E6||agc:agc-sar|Tetrahymena
LIVIKKL-GFGQFGS--------------------------------------------------------------------V-F---LVK-E-----------KG-------------------K------K-----K--L--YGLKC----VS-K---A----Q-VVeq--------------S--L-E--K-------------H-------I-QN-EKQV-MEF------N------------------N--------F---PFVMKFLRSF-----------K-------D----------D----R----C----IY--------FLLE-FI-------Q-GM----ELFD---V---I-R---E--I---G------------------------------------------------------------------------------L-------L-------S-T-------YDSQFYIGSLILTLEYLH-S-N-YI---------IYRDIKPENIM---V--------D--------------------------H------------------A-G-YLKLIDM---GT---AK-----------I-----------M------------K---S-------Ka--------------------------------G-T--V-------------------T----------Rtf---T-IIG-T----P-------------------HY--MAPEV--I---------S-------G-------------KG--Y-NFLVDLWSV---GICLYEFMC----------G--Y---V-----P-F------Aee--A-E-------D---------------PY---E----IY----E-----------EII------K---------------------K-------------------------------------------E---------------------I---Q-F-P---A--Y----M----K-----D-----A--V---AK---QLMLQ--------LLNK-IPEIRLG
The write()
function takes three arguments — 1) a sequence object, 2) filename, and 3) file format. The code below reads a fasta file with multiple sequences and then save the first 10 sequences in a new file.
all_seqs = []
for seq_record in SeqIO.parse("kinases.txt", "fasta"):
all_seqs.append(seq_record)
SeqIO.write(all_seqs[0:10],"test.aln","clustal")
10
Multiple Sequence Alignment¶
The AlignIO
class has functions to parse alignment files. The read()
and write()
functions have a similar syntax to the corresponding functions in the SeqIO
class. The alignment object stores sequences in 2D array format such that the rows are number of sequences and columns represent alignment length. To extract a sub-set of an alignment, slicing feature can be used. The code below shows reading an alignment file in fasta format followed by selecting a portion of this alignment and save it in a new file in clustal format. The subset is extracted by giving the range for the rows and columns within square brackets. The numbering for both rows and columns starts with zero. In the example below first ten sequences in the alignment are selected since range of rows is :10
and the colums range is 3:12
.
from Bio import AlignIO
align1 = AlignIO.read("kinases.txt", "fasta")
print(align1)
Alignment with 5489 rows and 1094 columns
YELLKVL-GKGAFGE-----------------------------...RPT consensus
LVYIKEL-GRGNFGS-----------------------------...RLG sp|Q22RR1||agc:agc-sar|Tetrahymena
LIVIKKL-GFGQFGS-----------------------------...RLG sp|Q234E6||agc:agc-sar|Tetrahymena
LEVVKKL-GNGQFGS-----------------------------...RTG sp|Q23KG5||agc:agc-sar|Tetrahymena
LIYIKTL-AFGQFGP-----------------------------...RLG sp|Q23DN8||agc:agc-sar|Tetrahymena
LIYIKKL-GVGQFGS-----------------------------...RLG sp|I7MFS4||agc:agc-sar|Tetrahymena
MIYIKKL-GFGQFGS-----------------------------...RLG sp|I7M3B5||agc:agc-sar|Tetrahymena
MISIKKL-GFGQFGS-----------------------------...RLG sp|I7MD55||agc:agc-sar|Tetrahymena
LIYIKKL-GEGQFGI-----------------------------...RLG sp|Q869J9|pkg-2|agc:pkg|Paramecium
FEMIRVL-GKGCAGK-----------------------------...RIR sp|A8N3F0||agc:agc-unique|Coprinopsis
TERMVVL-KGNLFGF-----------------------------...RLG sp|D6RN25||agc:agc-unique|Coprinopsis
YSLTRFL-SQGAQGK-----------------------------...RIH sp|A8P464||agc:agc-unique|Coprinopsis
FDIISSY-NADSDTR-----------------------------...RLN sp|A8N0Q1||agc:agc-unique|Coprinopsis
FEKKFVL-GKGSYGK-----------------------------...RLQ sp|Q4Q7M5||agc:agc-unique|Leishmania
FEILKHI-GSGAYGE-----------------------------...RLG sp|Q19858||agc:agc-unique|Caenorhabditis
FEMMRVL-GKGCAGK-----------------------------...RM- sp|A0A0B7FBE1|cbk1|agc:ndr|Thanatephorus
FELLKVL-GVGSFGR-----------------------------...RLG sp|P34102|pkgc|agc:agc-unique|Dictyostelium
FDLLKVL-GVGSFGR-----------------------------...RLG sp|F4PNA6|pkgc|agc:agc-unique|Dictyostelium
...
------------NSE-----------------------------...--- sp|W5L2T9|mlkl|tkl:tkl-unique|Astyanax
sub_alignment = align1[:10,3:12] #[row range, column range]
print(sub_alignment)
AlignIO.write(sub_alignment,"msa1.aln","clustal")
Alignment with 10 rows and 9 columns
LKVL-GKGA consensus
IKEL-GRGN sp|Q22RR1||agc:agc-sar|Tetrahymena
IKKL-GFGQ sp|Q234E6||agc:agc-sar|Tetrahymena
VKKL-GNGQ sp|Q23KG5||agc:agc-sar|Tetrahymena
IKTL-AFGQ sp|Q23DN8||agc:agc-sar|Tetrahymena
IKKL-GVGQ sp|I7MFS4||agc:agc-sar|Tetrahymena
IKKL-GFGQ sp|I7M3B5||agc:agc-sar|Tetrahymena
IKKL-GFGQ sp|I7MD55||agc:agc-sar|Tetrahymena
IKKL-GEGQ sp|Q869J9|pkg-2|agc:pkg|Paramecium
IRVL-GKGC sp|A8N3F0||agc:agc-unique|Coprinopsis
1
# %load msa1.aln
CLUSTAL X (1.81) multiple sequence alignment
consensus LKVL-GKGA
sp|Q22RR1||agc:agc-sar|Tetrahy IKEL-GRGN
sp|Q234E6||agc:agc-sar|Tetrahy IKKL-GFGQ
sp|Q23KG5||agc:agc-sar|Tetrahy VKKL-GNGQ
sp|Q23DN8||agc:agc-sar|Tetrahy IKTL-AFGQ
sp|I7MFS4||agc:agc-sar|Tetrahy IKKL-GVGQ
sp|I7M3B5||agc:agc-sar|Tetrahy IKKL-GFGQ
sp|I7MD55||agc:agc-sar|Tetrahy IKKL-GFGQ
sp|Q869J9|pkg-2|agc:pkg|Parame IKKL-GEGQ
sp|A8N3F0||agc:agc-unique|Copr IRVL-GKGC
Running BLAST over the internet¶
Biopython offers a functionality to programmatically run BLAST on the NCBI servers using the Bio.Blast
class.
To run blast online at NCBI servers, Bio.Blast
can be used which has different function to run Blast and also to parse the output. The NCBIWWW
library has qblast()
function that takes three arguments &emdash; 1) blast program (blastp, blastn, etc.), 2) database (any of the databases available at NCBI, and 3) sequence. Once the blast serach is over the output can be saved in a file. This output would be in XML format. You can use read()
function within the NCBIXML class to parse this output. The code below shows running a blast search using qblast
against the non-redundant database available at in NCBI.
The output file saved in the previous step has all the hits identified in the Blast search. These hits follow a hierarchical manner such that each result would have multiple alignments and within each alignment would be multiple high scoring pairs (hsps) i.e. Blast object \(\longrightarrow\) Alignment \(\longrightarrow\) hsps. For more details on this you may refer to the Blast documentation available at NCBI.
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
result_ncbi = NCBIWWW.qblast("blastn", "nt", "8332116")
with open("my_blast.xml", "w") as file_handle:
file_handle.write(result_ncbi.read())
result_handle = open("my_blast.xml")
blast_record = NCBIXML.read(result_handle)
count = 0
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
print(hsp)
count += 1
if(count==5):
break
Score 482 (435 bits), expectation 4.3e-117, alignment length 624
Query: 59 ACAGAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGA...GTA 678
|| ||||||||| |||| | |||| || |||| |||| | ||||... ||
Sbjct: 278 ACCGAAAATGGGCAGAGGAGTGAATTATATGGCAATGACACCTGA...TTA 901
Score 468 (423 bits), expectation 2.7e-113, alignment length 590
Query: 63 AAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAA...CCC 649
|||||||| ||| |||| | || ||||| |||||||| || |||...|||
Sbjct: 11 AAAATGGGTAGACGAATGGATTATTTGGCGATGAAAACCGAGCAA...CCC 600
Score 448 (405 bits), expectation 7.2e-108, alignment length 597
Query: 87 TTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGAT...TAG 679
||||||||||||||||| ||| |||| |||||||| |||| |||...|||
Sbjct: 81 TTGGCCATGAAAACTGAGCAAATGGCGTTGGCTAATTTGATAGAT...TAG 677
Score 441 (398 bits), expectation 1.1e-105, alignment length 593
Query: 65 AATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATT...CTG 655
||||||||| ||| | | | ||||||||||||||||||| ... ||
Sbjct: 254 AATGGGGAG-GAA--GGATAATTTGGCCATGAAAACTGATCC---...ATG 838
Score 439 (397 bits), expectation 3.7e-105, alignment length 596
Query: 63 AAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAA...CTG 655
||||||||||| ||| ||| ||||| |||| |||||||| |...|||
Sbjct: 170 AAAATGGGGAGG---ATGGAGTTTTTGGCTATGAGAACTGATCCA...CTG 756
The hsps
object has several attributes including the Blast statistics such as evalue, score, positives, etc. These can be used to extract hits based on certain conditions. E.g., the code below shows saving hits from the previous Blast search with evalue greater than 1e-105 to a new file.
with open("new_file.txt", "w") as file_handle:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if (hsp.expect < 1e-105):
file_handle.write(str(hsp)+"\n")
file_handle.write("\n")
print("DONE")
DONE
BLAST search using sequence file¶
To run the Blast search using a sequence file instead of gi number, we first need to create a seqeunce object and then pass it on to the qblast
function as shown below. To run this code, save the protein sequence below in a new file example1.fasta.
MFHPGMTSQPSTSNQMYYDPLYGAEQIVQCNPMDYHQANILCGMQYFNNSHNRYPLLPQMPPQFTNDHPY DFPNVPTISTLDEASSFNGFLIPSQPSSYNNNNISCVFTPTPCTSSQASSQPPPTPTVNPTPIPPNAGAV LTTAMDSCQQISHVLQCYQQGGEDSDFVRKAIESLVKKLKDKRIELDALITAVTSNGKQPTGCVTIQRSL DGRLQVAGRKGVPHVVYARIWRWPKVSKNELVKLVQCQTSSDHPDNICINPYHYERVVSNRITSADQSLH VENSPMKSEYLGDAGVIDSCSDWPNTPPDNNFNGGFAPDQPQLVTPIISDIPIDLNQIYVPTPPQLLDNW CSIIYYELDTPIGETFKVSARDHGKVIVDGGMDPHGENEGRLCLGALSNVHRTEASEKARIHIGRGVELT AHADGNISITSNCKIFVRSGYLDYTHGSEYSSKAHRFTPNESSFTVFDIRWAYMQMLRRSRSSNEAVRAQ AAAVAGYAPMSVMPAIMPDSGVDRMRRDFCTIAISFVKAWGDVYQRKTIKETPCWIEVTLHRPLQILDQL LKNSSQFGSS
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
seq_file = open('example1.fasta')
result_handle2 = NCBIWWW.qblast("blastp", "nr", seq_file.read())
seq_file.close()
with open("test_blast.xml", "w") as out_handle:
out_handle.write(result_handle2.read())
blast_output = open("test_blast.xml")
blast_record = NCBIXML.read(blast_output)
print(blast_record.alignments[0])
sp|P45897.1| RecName: Full=Dwarfin sma-4; AltName: Full=MAD protein homolog 3 [Caenorhabditis elegans] >gb|AAA97605.1| SMA-4 [Caenorhabditis elegans]
Length = 570
Let’s say we need only the alignment with the mouse sequence, then, to print first 50 characters of each alignment with the mouse sequence along with corresponding statistics, the following code can be used.
for alignment in blast_record.alignments:
if "musculus" in alignment.title:
print(alignment.title)
for hsp in alignment.hsps:
print(hsp.query[0:50])
print(hsp.match[0:50])
print(hsp.sbjct[0:50])
print(hsp.positives, hsp.score, hsp.expect)
ref|NP_001351897.1| mothers against decapentaplegic homolog 4 isoform 3 [Mus musculus] >gb|ADO32892.1| mothers against decapentaplegic-like protein 4 splice variant 3 [Mus musculus]
TAMDSCQQISHVLQCYQQGGEDSDFVRKAIESLVKKLKDKRIELDALITA
T+ D+C I H L C++QGGE F ++AIESLVKKLK+K+ ELD+LITA
TSNDACLSIVHSLMCHRQGGESETFAKRAIESLVKKLKEKKDELDSLITA
272 794.0 6.86183e-96
To save the Blast output in csv format, we can use the csv
library as shown below.
import csv
csv_out = open("blast_out.csv", "w", newline='')
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
new_row = csv.writer(csv_out, delimiter=",")
new_row.writerow([alignment.title.split("|")[-1], hsp.positives, hsp.expect])
csv_out.close()