GNU parallel used with xargs and awk

Question

I have two large tab separated files A.tsv and B.tsv, they look like (the header is not in the file):

A.tsv:  
ID AGE  
User1  18   
...

B.tsv:  
ID INCOME  
User4  49000  
...

I want to select list of IDs in A such that 10=< AGE <=20 and select rows in B that match the list. And I want to use GNU parallel tool. My attempt is two steps:

cat A.tsv | parallel --pipe -q awk '{ if ($3 >= 10 && $3 <= 20) print $1}' > list.tsv

cat list.tsv | parallel --pipe -q xargs -I% awk 'FNR==NR{a[$1];next}($1 in a)' % B.tsv > result.tsv

The first step works but the second one comes with error like:

awk: cannot open User1 (No such file or directory)

How can I fix this? Does this method work even if A.tsv and list.tsv are 2 to 3 times bigger than the memory?

does the word 'User1' exist in your list.tsv? Should it? If not, why is it there? Good luck. — shellter, Feb 12 '14 at 20:51
Yes, the word 'User1' exist in the file, the header line containing ID ,AGE or INCOME is not, I guess it's there because GNU parallel's --pipe argument doesn't work in the second step and treat the output as file name argument but not stdin, I don't know why. — Bamqf, Feb 12 '14 at 22:33
while I appreciate brief example files very much, it's not clear why you need parallel and xargs. it should be easy to construct a oneliner with awk that does what you want, assuming you don't have terabytes of data to process. Good luck! — shellter, Feb 12 '14 at 22:39

user32 · Accepted Answer · 2014-02-12T22:52:40.270

4

$ for I in $(seq 8 2 22); do echo -e "User$I\t$I" >> A.txt; done; cat A.txt
User8   8
User10  10
User12  12
User14  14
User16  16
User18  18
User20  20
User22  22

$ for I in $(seq 8 2 22); do echo -e "User$I\t100${I}00" >> B.txt; done; cat B.txt
User8   100800
User10  1001000
User12  1001200
User14  1001400
User16  1001600
User18  1001800
User20  1002000
User22  1002200

$ cat A.txt | parallel --pipe -q awk '{if ($2 >= 10 && $2 <= 20) print $1}' > list.txt
$ cat B.txt | parallel --pipe -q grep -f list.txt
User10  1001000
User12  1001200
User14  1001400
User16  1001600
User18  1001800
User20  1002000

edited Feb 12 '14 at 22:52

answered Feb 12 '14 at 21:38

user32

177
1
2

Thank you but I need a solution which enables parallel computing – Bamqf Feb 12 '14 at 22:35
I add `parallel --pipe -q` – user32 Feb 12 '14 at 22:53

juan4 · Answer 2 · 2022-05-14T18:08:34.510

0

I know this: (yes, I saw it) GNU parallel used with xargs and awk Asked 8 years, 3 months ago Modified 8 years, 3 months ago Viewed 2k times

My solution: only xargs and awk, only a line without intermediate file, and you don't need install a new tool

awk '{if ($2 >= 10 && $2 <= 20) print $1}' A.tsv | xargs -I myItem awk --assign quebuscar=myItem '$1==quebuscar {print}' B.tsv

edited May 14 '22 at 18:08

answered May 14 '22 at 18:05

juan4

1
1

GNU parallel used with xargs and awk

2 Answers2