I have a reference file (.fasta) and a list of gene IDs. For each ID in the gene ID list, I need to get the corresponding sequence into a text file. How can I automate this?
Things I've tried so far:
- sed
sed -n -e '/{GENEID1}/,/>/p' referencefile.fasta | sed $d >> seqs.txt
with '>' being the character at which I'd like sed to stop. I need the second sed to remove the last line, which grabs the first line of the next sequence, too. This works if I just run it once, but if I try
cat geneID.txt | xargs sed -n -e '/{}/,/>/p' referencefile.fasta >> seqs.txt
then I get just a list of IDs, with no sequences. It also takes super long, so I assume sed is reading through the reference file, but I don't see why it won't grab the sequences?
- grep
grep -o -P '(?={GENEID}).*(?=>)
Here I have the same issue - works individually, but not with xargs or a loop.
using a for loop
for LINE in $(cat geneIDs.txt); do echo $LINE >> seqs.txt sed -n -e '/$LINE/,/>/p' referencefile.fasta | sed $d >> seqs.txt done
I'm also open to trying something in python, though I'm not that well-versed in it yet. My preliminary attempt has been based on this question here. I have a test ID list of 10 lines, which I tried to run like this:
t = open('test.txt', 'r')
test = t.readlines()
test = test.split()
t.close()
with open('referencefile.fasta', 'r') as ref:
for line in ref:
for i in test:
if i in line:
print(line)
This one, I couldn't even get a sequence from the reference file, regardless of the loop.
Can you guys spot the issue? Why won't any of these give me sequences?
Thanks in advance!
Edited to add:
Example reference:
>000000F
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
>000001F
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>000002F
TGCGTGAGGTGCTAGGGATGACAATTGAAAAGAGGACATTGATCGATCACTTGACTCATTTCAGAAAGGAGTTTGGGTTGTCCAACAAGTTGAGGGGGATGATCATCAGGCATCCTGAGT
test IDs: 000000F, 000001F
Ideal result:
000000F ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
000001F NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Current result:
000000F 000001F