0

I need to go through a really large vcf file to find matching information (matching rows according to column values).

Here is something I tried so far, but it is not working and really problematic.

target_id=('id1' 'id2' 'id3' ...)

awk '!/#/' file_in | cut -f3,10-474|
for id in $target_id
do
    grep "target"
done

It only loop through the file looking for the first id in the target_id list.

I'm wondering is there a way to loop through the file looking for all the ids in the target_id list? And I want to output the entire row (3rd, 10-474th column) if 3rd colmn is matching.

lambda
  • 97
  • 2
  • 7
  • The argument of your `for` loop is a single string, hence it is executed only once. Also, the variable substitution `${a list of ids}` is nonsense. While environment variables are permitted to contain spaces, shell variables are not. – user1934428 Jul 19 '19 at 05:06
  • @user1934428 Sorry for the ugly codes, just edited – lambda Jul 19 '19 at 07:08
  • Ok .. "piping to a for loop" does not really make sense, or is too much complicated. Please try with pipe to "while read line" instead. Additionnally, please don't forget to add and ending backslash after pipe, to tell the shell that your commands do not end at carriage return just after pipe. You need to define first a for loop to go through all values of target_id, then, inside this loop, use the while read line to read file_in one lne at time and grep ... – kalou.net Jul 19 '19 at 12:26

1 Answers1

0

You may get the same behaviour as the for loop using a single grep for a bunch of target_id at once, using, for example;

egrep "id1|id2|id3"

This might improve the performance, as you don't have to fork a new instance of grep for each target_id .

You mentioned that the file_in (vcf file) is huge. As long as the filesystem limits are not reached, you won't get into trouble. For example, ext2, ext3 had a max file size of 2 Tb, ext4 has max file size of 16 Tb.

You may encounter issues regarding size of command line arguments, if the $target_id list is too big however.

Please find the resulting code below; (note that |\ is used to write a very long command using multiple lines. the \ tells the shell that the command continues on next line)

#!/bin/bash

target_id="id1 id2 id3"

awk '!/#/' file_in | \
cut -f3,10-474| \
egrep "$(echo $target_id | tr ' ' '|')"
kalou.net
  • 446
  • 1
  • 4
  • 16
  • I don't know why using egrep "id1|id2|id3|..." is not working ideally, but I managed to loop through each id and the egrep each of them. Thank you very much! – lambda Jul 19 '19 at 07:00
  • But I do wanna know why my original code might be wrong. It looked like this "target_list=$(sed -e '1d' file_in | cut -f2)". And then when using egrep, i did egrep "$(echo $target_id | tr ' \n' '|')" – lambda Jul 19 '19 at 07:04