AWK - Is it possible to search for pattern, then sort results based on a wildcard?

Question

I am storing a number of individually serialized PHP arrays to a file. Each line of the file contains one serialized array. For example:

a:2:{s:4:"name";s:8:"John Doe";s:3:"age";s:2:"20";}
a:2:{s:4:"name";s:8:"Jane Doe";s:3:"age";s:2:"15";}
a:2:{s:4:"name";s:12:"Steven Tyler";s:3:"age";s:2:"35";}
a:2:{s:4:"name";s:12:"Jim Morrison";s:3:"age";s:2:"25";}
a:2:{s:4:"name";s:13:"Apple Paltrow";s:3:"age";s:2:"75";}
a:2:{s:4:"name";s:12:"Drew Nickels";s:3:"age";s:2:"34";}
a:2:{s:4:"name";s:11:"Jason Proop";s:3:"age";s:2:"36";}

Here is my question:

Is it possible to "awk" this file for the following pattern: "name"*"*"

I would like to sort the lines that are found based on the contents of the second wildcard. Is it possible to accomplish this with awk?

I couldn't understand what it is that you want. Maybe it's just me, or you should rephrase and show an example of your desired output. — joechip, Jul 14 '11 at 17:01
I would like to be able to provide "awk" with a field name. In this example, it could be "name" or "age". For example, if I provided "name" as the field: I would like awk to search the file for the following pattern: "name"*"*". Essentially, I would like for awk to locate the value between double quotes that follows that field on each line. I would then like awk to sort the file based on those values. Did that make sense? — tambler, Jul 14 '11 at 17:17
I still don't understand the pattern you mention. What's with all those quotes? why would you want to search for double quotes? If you just want an awk script to list the names or ages from your file (to be sorted afterwards with the sort command), then I don't think that pattern would be useful at all. — joechip, Jul 14 '11 at 18:15

joechip · Answer 1 · 2011-07-15T04:30:35.827

I'm still unsure about what you want, but assuming Glenn Jackman's interpretation is correct then you would want to take his idea a bit further in order to be able to search for a given field name. E.g.,

awk -v FN="xxxx" -F '"' '{
    i=1;
    while (i<=NF-2) {
         if ($i==FN) {
              print $(i+2) "\t" $0;
              next
         } else {
              i++
         }
    }
}' filename | sort | cut -d $'\t' -f 2-

Here you would replace "xxxx" with "name", "age" or whatever field you want to use for sorting.

This script is not foolproof, of course. Fields cannot contain tab characters, and neither can they contain keywords such as "name", "age", etc.

Edit: I will briefly describe what this script does. Basically, awk takes a given field name, and for each line it extracts this field's value. So for each input line, it outputs that same line, but with this field's value prepended to it, and separates both elements with a tab character. This output is taken by the sort command, which sorts it lexicographically, and thus it is mostly sorted based on that prepended value, which is the field value you selected. Once sorted this way, this is taken by the cut command, which splices it on the tab character, discarding the field that was used for sorting, and only showing the rest (which corresponds to lines from your original file, but now sorted as you wanted).

Some more details:

In AWK (actually, in the Gawk variant) the -v switch defines a variable, in this case named FN. The -F switch defines a field separator, which will split each and every line that AWK reads from its input file. The main block defined between curly braces is the AWK program, which is run once for every input line. Each line field, as split according to the -F switch, is referenced with $1, $2, ... , $(NF-1), $NF. (NF is a builtin variable that is always equal to the number of fields on the current line).

As I said, AWK reads the input line by line and runs this program for each one. For example, if it takes this line:

a:2:{s:4:"name";s:12:"Jim Morrison";s:3:"age";s:2:"25";}

Then it splits it on the double quotes, like this:

$1 = a:2:{s:4:
$2 = name
$3 = ;s:12:
$4 = Jim Morrison
$5 = ;s:3:
$6 = age
$7 = ;s:2:
$8 = 25
$9 = ;}

The script then iterates over each field searching for an exact match on FN. So if for example we have defined FN=age, the loop will stop at $6, then it will print $8 (i.e., $(6+2), which is "25" here) concatenated with a tab character and then with the whole input line itself ($0). Then the next line will be read and the whole process will begin again.

This script relies on the assumption that the keywords cannot happen anywhere else. And this assumption is not easy to work around. There needs to be some more insight about how this input file is structured if you want to violate this assumption. For most purposes such insight is achievable, because this ambiguity would also affect any serialization parser. For example, if you know that the field name (say, "age") can appear exactly inside other fields, but only in fields ordered to be after the age field, then this script is fine as-is. In the given example, it would be strange to have a name field equal to "age" (like that, no capitalization, etc.). Anyway, this is a difficult problem and entire books deal with it so I won't summarize it here. Google for "compiler theory" if you're interested.

One such insight might be the one you mention: knowing the order of the fields. In that case, this whole script is not much better than Glenn's. You could adapt his simpler script to match each field you want. For example, consider:

awk -F '"' '{print $8 "\t" $0}' filename |
sort |
cut -d $'\t' -f 2-

This script is almost identical to the one Glenn proposed, only it selects on the eighth field ("age") instead of the fourth ("name").

I'm sorry if my original post was unclear, but you understand precisely what I am attempting to do. I was able to use your awk script to sort my file based on the "name" field. Awk is completely foreign to me, though... Would you mind explaining what adjustments would need to be made in order to sort on a field in the 1st, 2nd, 3rd, etc... columns? It is safe to assume that I will always know the order of the field in the file. — tambler, Jul 14 '11 at 22:38
Could you also tell me if anything could be done that would allow this script to work on files that DO contain field names within the "value" portion of the file? I am unable to guarantee that that will not happen. — tambler, Jul 14 '11 at 22:40
@tambler: yes, it will work no matter which field it is you are selecting. Indeed, that is the reason why I proposed it instead of Glenn's, since his version is only suitable for a given field number (in his example, only the first one would work). — joechip, Jul 15 '11 at 02:55

score 2 · Answer 2 · answered Jul 14 '11 at 17:40

2

Kind of a Schwartzian transform: I assume that the name is always the 4th quote-separated field

awk -F '"' '{print $4 "\t" $0}' filename |
sort |
cut -d $'\t' -f 2-

answered Jul 14 '11 at 17:40

glenn jackman

4,630
1
17
20

Actually this approach would only work for the "name" field, and only if it is always the first field listed in each record. Still, splitting fields on the double quote character is the correct idea IMHO. – joechip Jul 14 '11 at 18:21
Hence the stated assumption. – glenn jackman Jul 14 '11 at 19:02
@tambler, given the flexibility you're looking for, go with @joechip's answer. – glenn jackman Jul 15 '11 at 13:49

score 0 · Answer 3 · answered Aug 23 '11 at 19:14

You could do:

sort -t '"' -k4,4 filename
sort -t '"' -k8,8n filename

for name and age, respectively, but that doesn't allow you to select the field by its name and also requires tedious field counting.

A more robust method is presented in the script below which can be run either of these ways:

./fieldsort "name" inputfile
some_prog | ./fieldsort "name"

You can use "name" or "age" as the field name (or others if they're present).

Only gawk is used without requiring any other utilities.

There is a reduced chance of false positives since only the first record is checked for the position of the desired field and there would have to be a field value matching the desired field name appearing earlier in the record. These two conditions (first occurrence in the first record) also make this script faster.

A disadvantage is that it expects all records to be in the same format (number of fields, etc.).

No checking is made to ensure that a field name is selected (although it must exist) so "s" (the "string" field type), for example, would be accepted but not useful.

If multiple filenames are given on the command line, they must all have the same format. If you're using Gawk 4, you can change BEGIN to BEGINFILE and END to ENDFILE (and move the lines before getline and its comments to a new BEGIN clause) to circumvent this restriction.

#!/usr/bin/gawk -f

func isnum(x) {
    # not foolproof
    return(x == x + 0)
}

BEGIN {
    fieldname = ARGV[1]
    delete ARGV[1]

    FS = "[;:\"]"
    # since gawk doesn't have a numeric sort, pad numbers
    padstr = "000000000000"

    # process the first line to see which field we want
    # do this in the BEGIN clause to avoid repeating it for every record
    getline

    split($0, fields, FS)
    for (f = 1; f <= length(fields); f++) {
        if (fields[f] == fieldname) {
            field = f + 5
            break
        }
    }

    if (field == 0) {
        print "field '" fieldname "' not found in file '" FILENAME "'"
        exit
    }

    if (isnum($field))
        # pad will be null for non-numeric data
        pad = substr(padstr, 1, length(padstr) - length($field))

    # since we burned the first line, we need to go ahead and save it here

    # the record number is included in the index to prevent losing records
    # that have duplicate values in the field of interest

    array[pad $field, NR] = $0
}

{
    # save each of the rest of the lines in the array indexed by the field of interest

    if (isnum($field))
        pad = substr(padstr, 1, length(padstr) - length($field))

    array[pad $field, NR] = $0
}

END {
    # sort and output
    c = asorti(array, indices)
    for (i = 1; i <= c; i++)
        print array[indices[i]]
}

But I wonder why don't you do this natively in PHP?

AWK - Is it possible to search for pattern, then sort results based on a wildcard?

3 Answers3