1

I'm trying to find al URLs in a html file and make a link (a href) of it with terminal.

This is what I have now:

awk '/((19|20)\d{2}-.*\.jpg)/ { print "<a href='"$1"'>$1</a>" }' index.html

After running this there are no errors in the terminal, but the file isn't changed at all.

The filenames are formatted like this: 2016-12-14-PHOTO-00000042.jpg.

Does somebody have a solution?

update Thanks for the responses already! I will give some further info to make the question complete:

Some of the index.html content:

<p>16-12-16 11:45:43: Wouter: 2016-12-16-PHOTO-00000128.jpg
</p>
<p>16-12-16 11:58:49: Jelle: Hello
</p>
<p>16-12-16 19:05:44: Jelle: 2016-12-16-PHOTO-00000133.jpg
</p>

I'm using macOS 10.12.4.

  • Lets say your string is `2016-12-14-PHOTO-00000042.jpg` then what would be the expected output. – Sahil Gulati May 19 '17 at 11:54
  • What's in your `index.html` ? – luoluo May 19 '17 at 11:55
  • Please provide a sample index.html and spell out the errors you receive. A wild guess for starters: You have multiple filenames in your index.html and the greedy catch-all operates creates a single match spanning all stuff from the beginning of the first to the end of the last of the filenames. Try `awk '/((19|20)\d{2}-.*?\.jpg)/ { print "$1" }' index.html` (`.*` -> `.*?` in the re). – collapsar May 19 '17 at 12:00
  • Which format do you want to match ? Cause you can for example simplify this with only matching `.*?-PHOTO-.*?\.jpg` – kaldoran May 19 '17 at 12:02
  • 1
    I'm guessing it's because the single quotes [need to be escaped](http://stackoverflow.com/questions/9899001/how-to-escape-single-quote-in-awk-inside-printf/9899594). – LukStorms May 19 '17 at 12:11
  • @LukStorms you cannot escape single quotes inside any single-quote delimited script called from any shell. The answer in that question you reference is misleading as it's not escaping the single quote inside the awk script, it's breaking out of the awk script back into shell and THERE escaping the quote which is just silly when you can/should simply replace the quotes with `\047` instead. That would NOT cause a problem for the posted code though as the OP is using the single quotes to break out of awk to let the shell variable `"$1"` expand which is bad practice but not a syntax error. – Ed Morton May 19 '17 at 13:40
  • Wouter - very few tools will understand `\d`. If you want to match a digit then just use `[0-9]` or `[[:digit:]]` for theoretical portability across locales though I haven't actually heard of any case where `[0-9]` doesn't work. [edit] your question to include concise, testable sample input and expected output so we can help you. Include what you want `$1` to be, etc. - an actual use case. Also what platform (OSX? Solaris? Linux?), shell and awk version are you using? btw to be clear using `'"$1"'` CAN produce [cryptic] syntax errors but that's not a result of not escaping the single quotes. – Ed Morton May 19 '17 at 13:48
  • After a few suggestions I updated the code to: `awk '/((19|20)[0-9]{2}-.*?\.jpg)/ { print ""$1"" }' index.html` Now I get a response like this: `

    14-12-16

    16-12-16

    16-12-16

    16-12-16

    17-12-16

    18-12-16

    ` It's getting the

    and the date from the beginning of the line so $1 isn't good.. What do I have to write to get `2016-12-14-PHOTO-00000042.jpg` instead of `

    18-12-16`?

    – Wouter Beugelsdijk May 19 '17 at 18:22

1 Answers1

1

WIth GNU awk for the 3rd arg to match():

$ awk 'match($0,/(.*)((19|20)[0-9]{2}-.*\.jpg)(.*)/,a) {
    $0=a[1] "<a href=\"" a[2] "\">" a[2] "</a>" a[4]
} 1' file
<p>16-12-16 11:45:43: Wouter: <a href="2016-12-16-PHOTO-00000128.jpg">2016-12-16-PHOTO-00000128.jpg</a>
</p>
<p>16-12-16 11:58:49: Jelle: Hello
</p>
<p>16-12-16 19:05:44: Jelle: <a href="2016-12-16-PHOTO-00000133.jpg">2016-12-16-PHOTO-00000133.jpg</a>
</p>

With other awks you'd use substr() with the RSTART and RLENGTH result of the match():

$ awk 'match($0,/(19|20)[0-9]{2}-.*\.jpg/) {
    a[1]=substr($0,1,RSTART-1)
    a[2]=substr($0,RSTART,RLENGTH)
    a[4]=substr($0,RSTART+RLENGTH)
    $0=a[1] "<a href=\"" a[2] "\">" a[2] "</a>" a[4]
} 1' file
<p>16-12-16 11:45:43: Wouter: <a href="2016-12-16-PHOTO-00000128.jpg">2016-12-16-PHOTO-00000128.jpg</a>
</p>
<p>16-12-16 11:58:49: Jelle: Hello
</p>
<p>16-12-16 19:05:44: Jelle: <a href="2016-12-16-PHOTO-00000133.jpg">2016-12-16-PHOTO-00000133.jpg</a>
</p>

You don't NEED to populate array a[] in that case of course, I'm just using it to contrast with what the gawk script is doing.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Gives an syntax error: `awk: syntax error at source line 1 context is >>> match($0,/(.*)((19|20)[0-9]{2}-.*\.jpg)(.*)/, <<< awk: bailing out at source line 1` – Wouter Beugelsdijk May 19 '17 at 18:16
  • You're using gawk extensions in an awk that isn't gawk. I updated my answer to show a non-gawk solution but do get gawk as right now you're missing a ton of useful functionality. – Ed Morton May 19 '17 at 18:37
  • 1
    This @Ed Morton works beautifully.. Thanks! I just learned of the existence of awk (and I installed gawk now) today and I learned a lot! – Wouter Beugelsdijk May 19 '17 at 19:00