Big grep from txt list in .gz file logs

Question

this is my problem (for me actually a big problem).

I have a txt file with 1.130.395 lines as below an example:

10812
10954
10963
11070
11099
10963
11070
11099
betti.bt
betti12
betti1419432307
19442407
19451970
19461949

i have like 2000 .gz log files.

I need that for every line of the .txt file a grep is performed on all .gz files.

This is an example of the contents of the gz files, an example line:

time=2019-02-28 00:03:32,299|requestid=30ed0f2b-9c44-47d0-abdf-b3a04dbb560e|severity=INFO |severitynumber=0|url=/user/profile/oauth/{token}|params=username:juvexamore,token:b73ad88b-b201-33ce-a924-6f4eb498e01f,userIp:10.94.66.74,dtt:No|result=SUCCESS
time=2019-02-28 00:03:37,096|requestid=8ebca6cd-04ee-4818-817d-30f78ee95731|severity=INFO |severitynumber=0|url=/user/profile/oauth/{token}|params=username:10963,token:1d99be3e-325f-3982-a668-30494cab9a96,userIp:10.94.66.74,dtt:No|result=SUCCESS

The txt file contains the username. I need to search in the gz files if the username is present for the url with "profile" parameters and for "result=SUCCESS".

if something is found, write to a log file only: username found; name of the log file in which it was found

It is possibile to do something? I know that i need to use zgrep command, but can someone help me....it is possibile to automate the process to let it go?

Thanks all

no, i report another example from one of txt entries that i have reported as examples: time=2019-02-28 00:03:37,096|requestid=8ebca6cd-04ee-4818-817d-30f78ee95731|severity=INFO |severitynumber=0|url=/user/profile/oauth/{token}|params=username:10963,token:1d99be3e-325f-3982-a668-30494cab9a96,userIp:10.94.66.74,dtt:No|result=SUCCESS — Emanuele, Feb 28 '19 at 17:35
I add that the txt file contains both numbers and letters, for example: betti.bt betti12 betti1419432307 19442407 19451970 19461949 — Emanuele, Feb 28 '19 at 17:37
Can post example file with like 10 lines, example gz file with like 10 lines and show the result for these inputs? It's unclear to me, what do you want to grep for? Just `grep 10812` ? — KamilCuk, Feb 28 '19 at 17:40

James Brown · Answer 1 · 2019-03-01T06:31:47.227

A rewrite using getline. It reads and hashes the file.txt usernames, then gunzips gzips given as parameters, splits until gets the field with the username:, extracts the actual username and searches it from the hash. Not properly tested etc. etc. standard disclaimer. Let me know if it worked:

$ cat script.awk
BEGIN{
    while (( getline line < ARGV[1]) > 0 ) {       # read the username file
        a[line]                                    # and hash to a
    }
    close(ARGV[1])
    for(i=2;i<ARGC;i++) {                          # read all the other files
        cmd = "gunzip --to-stdout " ARGV[i]        # form uncompress command
        while (( cmd | getline line ) > 0 ) {      # read line by line
            m=split(line,t,"|")                    # split at pipe
            if(t[m]!="result=SUCCESS")             # check only SUCCESS records
                continue
            n=split(t[6],b,/[=,]/)                 # username in 6th field
            for(j=1;j<=n;j++)                      # split to find it, set to u var:
                if(match(b[j],/^username:/)&&((u=substr(b[j],RSTART+RLENGTH)) in a)) {
                    print u,"found in",ARGV[i]     # output if found in a hash
                        break                      # exit for loop once found
                }
        }
        close(cmd)
    }
}

Run it (using 2 copies of the same data):

$ awk -f script.awk file.txt log-0001.gz log-0001.gz
10963 found in log-0001.gz
10963 found in log-0001.gz

I will try it right away. Can i launch it as 'awk -f script.awk lista2.txt log-*.gz log-*.gz'? because they are 2000 files gz? — Emanuele, Mar 01 '19 at 08:22
`awk -f script.awk lista2.txt log*gz` should work too. There is a limit to command line parameters so if there is too many, try to limit the number with proper globbing. And test it first with a couple of files. — James Brown, Mar 01 '19 at 09:01
@Emanuele it'd be interesting if you could try both currently proposed solutions running `time` on them and then let us know the result. — Ed Morton, Mar 01 '19 at 13:28
it seems not to work because lots of file gz to grep. After 4hours — Emanuele, Mar 04 '19 at 06:57

Ed Morton · Accepted Answer · 2019-03-01T04:07:51.533

0

I'd just do (untested):

zgrep -H 'url=/user/profile/oauth/{token}|params=username:.*result=SUCCESS' *.gz |
awk -F'[=:,]' -v OFS=';' 'NR==FNR{names[$0];next} $12 in names{print $12, $1}' names.txt - |
sort -u

or probably a little more efficient as it removes the NR==FNR test for every line output by zgrep:

zgrep -H 'url=/user/profile/oauth/{token}|params=username:.*result=SUCCESS' *.gz |
awk -F'[=:,]' -v OFS=';' '
    BEGIN {
        while ( (getline line < "names.txt") > 0 ) {
            names[line]
        }
        close("names.txt")
    }
    $12 in names{print $12, $1}' |
sort -u

If a given user name can only appear once in a given log file or if you actually want multiple occurrences to produce multiple output lines then you don't need the final | sort -u.

edited Mar 01 '19 at 04:07

answered Mar 01 '19 at 00:56

Ed Morton

188,023
17
78
185

I will prove this today ;) – Emanuele Mar 04 '19 at 06:57
Wow! i have made a test with one file and it works: SKYIDVIP_48909;Security.CerebroRM_APP14_Managed01.log.gz....now i will prove it with all 2000 gz files – Emanuele Mar 04 '19 at 07:42
@Emanuele So, how long did it take? 3 h, time between the posts? – James Brown Mar 04 '19 at 14:32
For all files gz (i've made multiple runs), it only takes less then 1 hour ;) – Emanuele Mar 06 '19 at 15:15

Big grep from txt list in .gz file logs

2 Answers2