1

I have a data file with lines containing a huge amount (~ 5K) of dates in format yy-dd-mm.

A tipical file line could be:

bla bla 21-04-26 blabla blabla 18-01-28 bla bla bla bla 19-01-12 blabla

I need to do this kind of replacement for any single date:

$ date --date="18-01-28" "+%A, %d %B %Y"
Sunday, 28 January 2018

I already solved this problem using sed (see the post scriptum for details).

I would like to use gawk, instead. I came up with this command:

$ gawk '{b = gensub(/([0-9]{2}-[0-9]{2}-[0-9]{2})/,"$(date --date=\"\\1\" \"+%A, %d %B %Y\")", "g")}; {print b}' 

The problem is that bash does not expand the date command inside gensub, in fact I obtain:

$ echo "bla bla 21-04-26 blabla blabla 18-01-28 bla bla bla bla 19-01-12 blabla" | gawk '{b = gensub(/([0-9]{2}-[0-9]{2}-[0-9]{2})/,"$(date --date=\"\\1\" \"+%A, %d %B %Y\")", "g")}; {print b}' 
bla bla $(date --date="21-04-26" "+%A, %d %B %Y") blabla blabla $(date --date="18-01-28" "+%A, %d %B %Y") bla bla bla bla $(date --date="19-01-12" "+%A, %d %B %Y") blabla

I do not get how I could modify the gawk command to obtain the desired result:

bla bla Monday, 26 April 2021 blabla blabla Sunday, 28 January 2018 bla bla bla bla Saturday, 12 January 2019 blabla

post scriptum:

For what concerns sed, I solved with this script

#!/bin/bash

#pathFile hard-coded here
pathFile='./data.txt'

#treshold to avoid "to many arguments" error with sed
maxCount=1000
counter=0

#list of dates in the data file
dateList=($(egrep -o "[0-9]{2}-[0-9]{2}-[0-9]{2}" "$pathFile" | sort | uniq))

#string to pass multiple instruction to sed
sedCommand=''

for item in ${dateList[@]}
do
    sedCommand+="s/"$item"/"$(date --date="$item" "+%A, %d %B %Y")"/g;"
    (( counter++ ))
    if [[ $counter -gt $maxCount ]]
    then
        sed -i "$sedCommand" "$pathFile"
        counter=0
        sedCommand=''
    fi
done
[[ ! -z "$sedCommand" ]] && sed -i "$sedCommand" "$pathFile"
vaeVictis
  • 484
  • 1
  • 3
  • 13
  • `command substitution` is a shell term. awk is not shell. You wouldn't talk about command substitution in C, you'd just talk about calling a function and saving it's output, and you can't call a shell command directly from C. Exactly the same is true of awk. Just like in C, in awk you can call awk functions, and there are ways (e.g. `system("date")`) to call external commands like Unix date, but you can't just call those external commands directly and when you do it's not called "command substitution" it's called "calling an external command". – Ed Morton Jun 27 '21 at 12:14
  • @EdMorton I understand your point. I was talking about the command substitution of date inside gensub though. – vaeVictis Jun 27 '21 at 17:05
  • My point was that since command substitution is a shell term/concept there is no command substitution of date inside gensub or command substitution of any other command anywhere else in an awk script though. From awk you spawn a shell to call a command, but the result of that command doesn't then replace the command somehow like it does in a shell if you use command substitution - you have to manually read the output of the command. – Ed Morton Jun 27 '21 at 17:16
  • 1
    Using command substitution in shell to set `foo` to the output of some command `cmd` is `foo=$(cmd)` but the shell equivalent of what we do in awk or C to get the output of the command is more equivalent to `foo=''; while IFS= read -r line; foo="$foo"$'\n'"$line"; done < <(cmd)`, i.e. it is not similar to command substitution. – Ed Morton Jun 27 '21 at 17:17

2 Answers2

3

Gawk has builtin functions to deal with date/time which would be MUCH faster compared to invoking the external date command.

Example input:

# cat file
79-03-21 | 21-01-01
79-04-17 | 20-12-31

The gawk script:

# cat date.awk
{
    while (match($0, /([0-9]{2})-([0-9]{2})-([0-9]{2})/, arr) ) {
        date = sprintf("%s-%s-%s", arr[1], arr[2], arr[3])
        #                           \_YY    \_MM    \_DD
        if (arr[1] >= 70) {
            time = sprintf("19%s %s %s  1  0  0", arr[1], arr[2], arr[3])
            #               YYYY MM DD HH MM SS
        } else {
            time = sprintf("20%s %s %s  1  0  0", arr[1], arr[2], arr[3])
        }
        secs = mktime(time)
        new_date = strftime("%A, %d %B %Y", secs)
        $0 = gensub(date, new_date, "g")
    }
    print
}

Result:

# gawk -f date.awk file
Wednesday, 21 March 1979 | Friday, 01 January 2021
Tuesday, 17 April 1979 | Thursday, 31 December 2020
pynexj
  • 19,215
  • 5
  • 38
  • 56
  • You could just do `year = (arr[1] < 70 ? 20 : 19) arr[1]` and then have 1 call to sprintf using that instead of 2 almost identical calls to sprintf. Why pick a year of "70" to separate the century at all though? There's no indication that `70` in the data means 1970 instead of 2070 so I think that's overcomplicating things and may be doing the wrong thing. – Ed Morton Jun 27 '21 at 12:32
  • `date = sprintf("%s-%s-%s", arr[1], arr[2], arr[3])` can just be `date=arr[0]`, and `$0 = gensub(date, new_date, "g")` can just be `gsub(date, new_date)` – Ed Morton Jun 27 '21 at 12:41
  • 1
    @EdMorton also GNU date makes a similar assumption about the limit, with 69 as treshold. I decided to set a treshold as a parameter inside the script. – vaeVictis Jun 27 '21 at 20:52
3

Just to show how to do "command substitution" using awk's pipes —

$ cat foo.awk
{
    while (match($0, /([0-9]{2}-[0-9]{2}-[0-9]{2})/, arr) ) {
        date = arr[1]
        cmd = "date -d " date " +'%A, %d %B %Y' "
        cmd | getline new_date
        # pipes are not closed automatically!
        close(cmd)
        $0 = gensub(date, new_date, "g")
    }
    print
}
$ cat file
79-03-21 | 21-01-01
79-04-17 | 20-12-31
$ gawk -f foo.awk file
Wednesday, 21 March 1979 | Friday, 01 January 2021
Tuesday, 17 April 1979 | Thursday, 31 December 2020
pynexj
  • 19,215
  • 5
  • 38
  • 56
  • thank you, I was about to ask if you could also show something about command substitution. By the way, your first answer completely address the problem. – vaeVictis Jun 27 '21 at 03:45
  • If that call to getline failed you'd end up with a repeat of the previously valid date in the output with no indication whatsoever that there was an issue. See http://awk.freeshell.org/AllAboutGetline for how to call getline. You're also passing input data (`arr[1]`) unquoted to the shell which will probably not really cause any problems this time given it has to be a pair of digits but you should really be quoting it like you should all strings in shell - `cmd = "date -d '" date "' +'%A, %d %B %Y'"`. – Ed Morton Jun 27 '21 at 12:37
  • Also if you use `\047` instead of `'` throughout your script then you can call it from the command line (or inline in a shell script) instead of requiring it to be in a separate awk script file. – Ed Morton Jun 27 '21 at 12:38