0

so I have around 30 folders & each folder contains ~ 75 text files. They are actually BGP dumps taken from routeviews
The total amount of data sums upto ~ 480 GBs & now I've to grep a list of IPs from it. So I used normal grep & looped over the IP list, & then looped over datasets, but it was too slow.
I came across this SO post on using grep using xargs & it worked well. I timed the results, it was ~ 7 sec for normal grep & ~ 1.5 sec for // grep for 1 folder.(using 24 instances at once)
Now here's the problem, when I did this in nested loop over IP list & datasets, it worked well for starting folders but eventually the multiple greps went in uninterruptible state, I don't know why...
The command I used for grep is -
find ./ -name "rib.${YYMM}${DATE}.${TIME}*" -print0 | xargs -0 -n1 -P24 grep ${prefix} | wc -l >> ${TIME}
where ${YYMM} , ${DATE} , ${TIME} is to identify files & ${prefix} is the iterable variable for prefixes to be grepped.
Sample naming convention for files -

rib.20191201.0200_chicago
rib.20191230.2000_chile
rib.20191215.1400_sydney

Here is the screenshot of htop
Specs : I'm using a Ubuntu 16.04.2 server, with 20 v-CPU & 50 GB RAM, definitely not overload on them according to htop.
Is there any workaround for this?
Since I got huge IP list, so even 1 sec makes a big difference in long run

Derviş Kayımbaşıoğlu
  • 28,492
  • 4
  • 50
  • 72
T Wellick
  • 11
  • 4
  • 1
    You are aware that `grep` matches a `regex` and that `.` matches any character, not a dot? Are you interested in `grep` output, or just the count? Why not `grep -c`? (I also think you could try `-n3`) `is the iterable variable for prefixes` So there any many prefixes? Why not filter all prefixes in one go (assuming I/O is the slowest)? `went in uninterruptible state` Are you sure that is not caused by waiting for I/O? – KamilCuk Sep 28 '20 at 10:53
  • Yep, I'm aware about the regex, but in my case it doesn't matter as I'll either get a exact match or not. I'm more interested only in count. The file format is just IP .... – T Wellick Sep 28 '20 at 10:56
  • There are about ~ 5000 prefixes, is it possible to filter them at one go? – T Wellick Sep 28 '20 at 10:58
  • `is it possible to filter them at one go?` Write an `awk` script. Would be slower then grep, but reading same files 5000 times will be way slower. Even `grep -of <(paterns) | sort | uniq -c` could be faster then I/O. – KamilCuk Sep 28 '20 at 11:00
  • Could you hint me a little with awk? – T Wellick Sep 28 '20 at 11:02
  • `seq 100 | awk 'BEGIN{ pattern[0]="1"; pattern[1]="2" } { for (i in pattern) if ($0 ~ pattern[i]) count[i]++ } END{ for (i in count) print i " occured " count[i] " times" }'` – KamilCuk Sep 28 '20 at 11:08
  • One last thing, should I run them parallelly with `nohup cmd &` or simply in foreground? Which one is better optimized? – T Wellick Sep 28 '20 at 11:18
  • "better optimized" depends on your hardware's capabilities. Typically you'll gain from a small amount of parallelism but too much will slow things down as processes start contending for scarce/shared resources (like disk head position and scheduling); hence why things like `xargs -P` exist. – Charles Duffy Sep 28 '20 at 11:26
  • That said, even if you do want to run something in the background in a way that survives the terminal it's in closing, there's barely ever a good reason to use `nohup`. bash's built-in `disown` and redirection of stdin/stdout/stderr and you have everything nohup does in native shell. – Charles Duffy Sep 28 '20 at 11:28
  • And I fully agree with the prior comments that suggest giving up grep in favor of a one-pass approach. – Charles Duffy Sep 28 '20 at 11:31
  • If you're overwhelming, say, an NFS server all your files are stored on, you get processes waiting for syscalls to return, and that's exactly what D state _is_. – Charles Duffy Sep 28 '20 at 11:32
  • Actually I said `nohup` because it's my university's server. I've to connect to their VPN every time before I can SSH into it – T Wellick Sep 28 '20 at 11:33
  • I did `ulimit -aH` & it shows `open files` limit as `1048576`. Will increasing the limit on open files have any significant effect on performance of grep? The max limit is `5133157` – T Wellick Sep 28 '20 at 16:26
  • No impact at all. That limit tells you when attempts to open files will fail. If you aren't getting errors caused by failures, changing it won't make any difference. – Charles Duffy Sep 28 '20 at 18:36

1 Answers1

0

Ok, finally I figured a way out !
I wrote a bash script to upload the whole 480GB database to mongoDB.
It takes significant amount of time, but it is helpful for me in long run when I got a lot of prefixes to search. I indexed it on the basis of prefixes & boom!
Now I can search a prefix in less than 200 milliseconds
which was 7 seconds in traditional grep
& ~ 1.5 seconds in grep with xargs
Wrote another shell script to loop over prefixes & it converted the computation of days into minutes
Sample code:
To upload:
mongoimport -d DB_name -c C_name --type csv --file "F_name" --headerline
To index:
mongo --quiet --eval "db.getCollection('C_name').createIndex({PREFIX:1});" DB_name
To search:
mongo --quiet --eval "db.getCollection('C_name').find({ PREFIX: '117.193.0.0/20' }).count();" DB_name

T Wellick
  • 11
  • 4