My goal was to extract the sequence of "Antimicrobial Peptides" from the NCBI Database using keywords such as "Antimicrobial Peptides, AMPs" and also specifying its length.
I have written a code to extract the sequences of Antimicrobial Peptides (AMPs) from the NCBI protein database using the Biopython library.
The Code
from Bio import Entrez
from Bio import SeqIO
def scrape_fasta_sequence(keywords):
# Provide your email address to NCBI
Entrez.email = 'mabdullahafzal02@gmail.com'
# Create the query string with filters
query = f'{keywords} AND srcdb_refseq[PROP] AND 7:50[SLEN] '
# Search for protein IDs that match the query
handle = Entrez.esearch(db='protein', term=query)
record = Entrez.read(handle)
id_list = record['IdList']
# Use the protein IDs to fetch the sequences
fasta_sequences = []
for protein_id in id_list:
handle = Entrez.efetch(db='protein', id=protein_id, rettype='fasta', retmode='text')
fasta_sequence = SeqIO.read(handle, 'fasta')
fasta_sequences.append(fasta_sequence)
handle.close()
return fasta_sequences
# Example usage
keywords = 'Antimicrobial Peptides OR AMPs'
fasta_sequences = scrape_fasta_sequence(keywords)
for fasta_sequence in fasta_sequences:
print(fasta_sequence)
Output
Output was not specific to the Antimicrobial Peptides. It provided me with other proteins which were not needed. output snippet
Kindly help, if code can be modified to add precision only towards antimicrobial peptides.