2

The problem is that: I have different txt files in which is registered a timestamp and an ip address for every malware packet that arrives to a server. What I want to do is create another txt file that shows, for every ip, the first time a malware packet arrives.

In general I want to do something like this :

for every  line in file.txt
 if (ip is not present in list.txt)
 copy timestamp and ip in list.txt

I'm using awk for doing it. The main problem is the "if ip is not present in list.txt". I'm doing this:

 {    a=$( grep -w "$3" list.txt | wc -c );
    if ( a == 0 )
   {
     #copy timestamp and ip in list.txt
   }

( i'm using $3 because the ip address is in the third column of the source file )

I don't know how to make awk evaluate the grep function. I've tried with backticks also but it didn't work. Someone could give me some hint?

I'm testing my script on test file like this:

10  192.168.1.1
11  192.168.1.2
12  192.165.2.4
13  122.11.22.11    
13  192.168.1.1
13  192.168.1.2
13  122.11.22.11
14  122.11.22.11
15  122.11.22.11
15  122.11.22.144
15  122.11.2.11
15  122.11.22.111

What should I obtain is:

10  192.168.1.1
11  192.168.1.2
12  192.165.2.4
13  122.11.22.11    
15  122.11.22.144
15  122.11.2.11
15  122.11.22.111

Thanks to your help I've succeded in creating the script that fits my needs :

awk '
FILENAME == ARGV[1] {
    ip[$2] = 1
    next
}
! ($2 in ip) {
    print $1, $2 >> ARGV[1]
    ip[$2] = 1
}
' list.txt file.txt 
papafe
  • 2,959
  • 4
  • 41
  • 72

4 Answers4

3

Interpreting the question as "How can I evaluate the status of a command from within awk?", just use system.

{
  if( system( "cmd" ) == 0 ) {
    # the command succeeded
  {
}

So, in your case, just do:

{
  if( system( "grep -w \"" $3 "\" list.txt > /dev/null " ) == 0 ) {
    ...
  }
}

You might want to reconsider your approach to the problem, though. Grepping each time is computationally expensive, and there are better ways to approach the problem. (Read list.txt once into an array, for example.)

Also, note that you do not need to use wc. grep fails if it doesn't match the string. Use the return value rather than parsing the output.

William Pursell
  • 204,365
  • 48
  • 270
  • 300
  • Or use the `-q` option of `grep` instead of verbosely redirecting output to /dev/null. – Chris Wesseling Oct 12 '11 at 15:23
  • 1
    @CharString The '-q' option for grep is not portable. Many implementations of grep will choke on it. (This may be a moot point, since I'm pretty sure -w is non-portable as well, but I think it's a good habit to avoid non-portable features where possible.) – William Pursell Oct 12 '11 at 15:27
  • Hmmm, GNU grep's manpage says "(-q is specified by POSIX.)" I agree on avoiding non-portable features. *edit*: It also says "Portable shell scripts should avoid both -q and -s" – Chris Wesseling Oct 12 '11 at 15:39
  • Thank you, that's exactly what I needed. I'll try it and I'll let you know. I knew too that grepping is not convenient but I was a little in a hurry so I decided for the fast way (that's because I don't know – papafe Oct 12 '11 at 16:02
  • the awk sintax so well to use arrays!). Sorry for the double comments but I had some problems! – papafe Oct 12 '11 at 16:13
2

This will save the result of execution into variable a

BEGIN {  } 
{
"grep -w \"$3\" list.txt | wc -c" | getline a
print a
}
END   {}
bvk256
  • 1,837
  • 3
  • 20
  • 38
1

You want to use getline:

BEGIN {
    "date" | getline current_time
     close("date")
     print "Report printed on " current_time
}

That takes the output of date and puts it into the current_time variable. You should be able to do the same with your grep | wc -l.

Mando Escamilla
  • 1,560
  • 1
  • 10
  • 17
1

But really what you want to do is get awk to read the list.txt file first, then process the other file with the list.txt data in memory. This will allow you to avoid calling system() for each line.

I assume the ip is in the 1st column of list.txt.

When you say copy timestamp and ip in list.txt, I assume you want to append some info from the current line of file.txt to the list.txt file.

awk '
    FILENAME == ARGV[1] {
        ip[$1] = 1
        next
    }
    ! ($3 in ip) {
        print $3, $(whatevever_column_holds_timestamp) >> ARGV[1]
    }
' list.txt file.txt

Given the sample file and simplified requirements of your question update:

awk '! seen[$2]++' filename

will produce the results you've seen. That awk program will print the line if the IP has not yet been seen.

glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • Theorically, this seems a good solution to my needs. I tried to use it but the list.txt in the end is a copy of file.txt but I don't know why – papafe Oct 12 '11 at 17:20
  • I was making some assumptions about the format of your files. If I don't have the columns numbers right, you'll have to update. – glenn jackman Oct 12 '11 at 17:41
  • Thanks for the help! I cannot use the "simplified" scripts because I need to run it over multiple source file. Regarding the firs script you posted I have actually changed the columns numbers. So, following the txt files posted in the update I put: $2 at the 3rd line, $2 at the 6th line, and the 7th line became "print $1, $2 ...". But seems it doesn't work. I'll try again tomorrow, maybe I'm doing some stupid error! – papafe Oct 12 '11 at 20:26
  • sure you can: if you want to extract only unique IP's from many files, just do `awk '!seen[$2]++' file1 file2 ... > all.uniq`; or if you want to extract unique IP's only from each file then `for f in file1 file2 ...; do awk '...' "$f" > "$f.uniq"; done` – glenn jackman Oct 12 '11 at 20:30
  • Now I've understood what's the problem with the first script you posted. It creates the ip array from list.txt before processing file.txt. Instead it should update the array as it process the file. So a "ip[$2] = 1" line should be put after the print line. This way the script seems working the way I was expecting to. Thank you very much! – papafe Oct 12 '11 at 22:02