Linux parsing space delimited log files

Question

I need to parse apache-access log files which has 16 space delimited columns, that is,

xyz abc ... ... home?querystring

I need to count total number of hits for each page in that file, that is, total number of home page hits ignoring querystring

For few lines the url is column 16 and for other its 14 or 15. Hence I need to parse each line in reverse order (get the last column, ignore query string of the last column, aggregate page hits)

I am new to linux, shell scripting. How do I approach this, do I have to look into awk or shell scripting. Can you give a small sample code that would perform such task.

ANSWER: perl one liner solved the problem

perl -lane | scalar array

Please give us example lines with these possible patterns. That is, when the url is in column 16, 15, 14. — janos, Dec 11 '13 at 18:55
Googling for "awk parse access.log" didn't return anything you liked? — tripleee, Dec 11 '13 at 18:56
Nope. I wanted something in reverse order and filter out querystring, that is one of the column is home/product/it?xyz=13&Redirect=1, this would be the last column in each line. I need to aggregate this column by filtering out the query string. Googling only gave me parsing from 1st col (left to right), but the column number is not constant in this scenario. — user1810502, Dec 11 '13 at 19:35
You solve it using the standard UNIX tool for parsing text files, i.e. awk. — Ed Morton, Dec 11 '13 at 20:27

score 0 · Answer 1 · answered Dec 11 '13 at 19:02

Well for starters, if you are only interested in working on columns 14-16, I would start by running

cut -d\  -f14-16 <input_file.log> | awk '{ one = match($1,/www/)
                                           two = match($2,/www/)
                                           three = match($3,/www/)
                                           if (one)
                                                print $1
                                           else if(two)
                                                print $2
                                           else if(three)

Note: there are two spaces after the d\

You can then pretty easily just count up the urls that you see. I also think this would be solved a lot easier using a few lines of python or perl.

fonini · Answer 2 · 2013-12-11T20:16:06.097

You can read line by line of input using the read bash command:

while read my_variable; do
    echo "The text is: $my_variable"
done

To get input from a specific file, use the input redirect <:

while read my_variable; do
    echo "The text is: $my_variable"
done < my_logfile

Now, to get the last column, you can use the ${var##* } construction. For example, if the variable my_var is the string some_file_name, then ${my_var##*_} is the same string, but whith everything before (and including) the last _ deleted.

We come up with:

while read line; do
    echo "The last column is: ${line##* }"
done < my_logfile

If you want to echo it to another file, use the >> redirect:

while read line; do
    echo "The last column is: ${line##* }" >> another_file
done < my_logfile

Now, to take away the querystring, you can use the same technique:

while read line; do
    last_column="${line##* }"
    url="${last_column%%\?*}"
    echo "The last column without querystring is: $url" >> another_file
done < my_logfile

This time, we have %%?* instead of ##*? because we want to delete what's after the first ?, instead of before the last. (Note that I have escaped the character ?, which is special to bash.) You can read all about it here.

I didn't understand where to get the page hits, but I think the main idea is there.

EDIT: Now the code works. I had forgotten the do bash keywork. Also, we need to use >> instead of > in order not to overwrite the another_file every time we do echo "..." > another_file. By using >>, we append to the file. I have also corrected the %% instead of ##.

By the way, this is bash-scripting. You can run it directly from the terminal, or put in a file and call `bash myfile`, or make the file executable (`chmod +x myfile`) and run it with `./myfile`. For the executable file to work, the first line of the file must be `#!/bin/bash`, where `/bin/bash` is the path to your `bash` binary, which you can view using the command `which bash`. — fonini, Dec 11 '13 at 19:10
Do not do any of the above. Parsing text files is the job awk was created to do and is best at. — Ed Morton, Dec 11 '13 at 20:19

score 0 · Answer 3 · answered Dec 11 '13 at 20:22

It's hard to say without a few lines of concrete sample input and expected output, but it sounds like all you need is:

awk -F'[ ?]' '{sum[$(NF-1)]++} END{for (url in sum) print url, sum[url]}' file

For example:

$ cat file                                                                      
xyz abc ... ... http://www.google.com?querystring
xyz abc ... ... some other http://www.google.com?querystring1
xyz abc ... some stuff we ignore http://yahoo.com?querystring1
$ 
$ awk -F'[ ?]' '{sum[$(NF-1)]++} END{for (url in sum) print url, sum[url]}' file
http://www.google.com 2
http://yahoo.com 1

Linux parsing space delimited log files

3 Answers3