0

Have a Master file (Master.txt) where each row is a string defining an HTML page and each field is tab delimited. The record layout is as follows:

<item_ID>   <field_1>   <field_2>   <field_3>
1   1.html  <html>[content for 1.html in HTML format]</html>    <EOF>
2   2.html  <html>[content for 2.html in HTML format]</html>    <EOF>
3   3.html  <html>[content for 3.html in HTML format]</html>    <EOF>

The HTML page is defined in <field_2>. <field_3> may not be necessary, but included here to indicate the logical location of end_of_file.

How to use awk to generate a file for each row (which begins with <item_ID>) where the content of the new file is <field_2> and the name of the new file is <field_1>?

Am running GNUwin32 under Windows 7 and will configure an awk solution to execute in a .bat file. Unfortunately can't do pipe-lining in Windows, so hoping for an single-awk-program solution.

TY in advance.

Jay Gray
  • 1,706
  • 2
  • 20
  • 36
  • 3
    Something like `awk -F"\t" '{print $3 > $2}' file` should make it. – fedorqui Aug 05 '14 at 15:02
  • 1
    possible duplicate of [Using the first field in AWK as file name](http://stackoverflow.com/questions/21555588/using-the-first-field-in-awk-as-file-name) – fedorqui Aug 05 '14 at 15:05
  • Can `[content for 1.html in HTML format]`, etc. contain tab characters or not? – Ed Morton Aug 05 '14 at 15:06
  • 1
    @fedorqui that would be an answer! – Kent Aug 05 '14 at 15:15
  • @Ed the only use of tabs is as a delimiter. Specifically, no tab in (the field with HTML content. – Jay Gray Aug 05 '14 at 15:16
  • 1
    @fedorqui Sorry - completely missed that solution when searching stackoverflow. Also - FYI - it did not pop when options are offered after entering the title. – Jay Gray Aug 05 '14 at 15:18
  • Jay if there's guaranted no tabs in the html then @fedorqui's suggestion would work fine if you add a `NR>1` to the front of it. – Ed Morton Aug 05 '14 at 15:35

1 Answers1

5

Assuming the HTML in field 3 may or may not contain tabs:

awk -F'\t' 'match($0,/<html>.*<\/html>/){print substr($0,RSTART,RLENGTH) > $2}' file
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • @Kent yes, it is $2 (the field with the target file name) – Jay Gray Aug 05 '14 at 15:23
  • 1
    @Ed I now have several Ed_Morton_files in my library. – Jay Gray Aug 05 '14 at 15:24
  • @Kent & Jay, thanks I updated the solution now to print to $2. If there's guaranteed to be no tabs in the HTML though then fedorqui's solution in a comment would work just as well if you add a `NR>1` to the front of it. – Ed Morton Aug 05 '14 at 15:33