I have one file (index1
) with 17,270,877 IDs, and another file (read1
) with a subset of these IDs (17,211,741). For both files, the IDs are on every 4th line.
I need a new (index2
) file that contains only the IDs in read1
. For each of those IDs I also need to grab the next 3 lines from index1
. So I'll end up with index2
whose format exactly matches index1
except it only contains IDs from read1
.
I am trying to implement the methods I've read here. But I'm stumbling on these two points: 1) I need to check IDs on every 4th line, but I need all of the data in index1
(in order) because I have to write the associated 3 lines following the ID. 2) unlike that post, which is about searching for one string in a large file, I'm searching for a huge number of strings in another huge file.
Can some folks point me in some direction? Maybe none of those 5 methods are ideal for this. I don't know any information theory; we have plenty of RAM so I think holding the data in RAM for searching is the most efficient? I'm really not sure.
Here a sample of what the index
look like (IDs start with @M00347):
@M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0
CCTAAGGTTCGG
+
CDDDDFFFFFCB
@M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0
CGCCATGCATCC
+
BBCCBBFFFFFF
@M00347:30:000000000-BCWL3:1:1101:15711:1332 1:N:0:0
TTTGGTTCCCGG
+
CDCDECCFFFCB
read1
looks very similar, but the lines before and after the '+' are different.