so I have around 30 folders & each folder contains ~ 75 text files. They are actually BGP dumps taken from routeviews
The total amount of data sums upto ~ 480 GBs & now I've to grep a list of IPs from it. So I used normal grep & looped over the IP list, & then looped over datasets, but it was too slow.
I came across this SO post on using grep using xargs & it worked well. I timed the results, it was ~ 7 sec for normal grep & ~ 1.5 sec for // grep for 1 folder.(using 24 instances at once)
Now here's the problem, when I did this in nested loop over IP list & datasets, it worked well for starting folders but eventually the multiple greps went in uninterruptible state
, I don't know why...
The command I used for grep is -
find ./ -name "rib.${YYMM}${DATE}.${TIME}*" -print0 | xargs -0 -n1 -P24 grep ${prefix} | wc -l >> ${TIME}
where ${YYMM}
, ${DATE}
, ${TIME}
is to identify files & ${prefix}
is the iterable variable for prefixes to be grepped.
Sample naming convention for files -
rib.20191201.0200_chicago
rib.20191230.2000_chile
rib.20191215.1400_sydney
Specs : I'm using a Ubuntu 16.04.2 server, with 20 v-CPU & 50 GB RAM, definitely not overload on them according to htop.
Is there any workaround for this?
Since I got huge IP list, so even 1 sec makes a big difference in long run