0

hi i actually tried to manage an awk script that fiddles with the a text file that has contents like follows

    . [135]Edwards Engineering Pty Ltd
       Quality Structural Steel. Specialising In Fabrication And Steel
       Stairs
       21- 23 Ada Ave, Brookvale NSW 2100
       ph: (02) 9938 5320

 . [269]Diavolo Steel Fabrication
       5 Humeside Drv, Campbellfield VIC 3061
       ph: (03) 9357 7947


       . [40]WH Williams Pty Ltd
       Your Partner For High Quality Custom-Made Metal Products
       Short lead times & unbeatable quality. Make us the first choice for
       your entire sheetmetal laser cutting,bending,welding & more.
       61- 77 Egerton St, Silverwater NSW 2128
       ph: (02) 9647 1277
            [41]www.whwilliams.com.au

and so on.. a huge file actually.. and the script i managed to write is

awk '$2 ~ /\. \[/{$1=x; print}' RS=\*  FS='\n' OFS='|' Myfile > excel.csv

this command converts my text file into a csv file with record separation.. but as you can see above, the lengths of addresses in the above mentioned example are varying and i am getting a csv file with irregular formatting..

so what i want to do now is change the command to put the 1.title of the company in one cell, 2.the description part , if exists in one cell and if doesnt exist, the cell to be left empty, 3. the address part in one cell, 4. the phone num in one cell 5.the website in one cell.. and if any particular component doesnt exist, that cell should be left empty..

i am new to linux and trying to handle stuff and pretty new to shell and awk too.. so can any one help me out if it is possibility to do so...

Kiran Vemuri
  • 2,762
  • 2
  • 24
  • 40

3 Answers3

0

I used the logic of converting one set of records separated by multiple lines to a single line separated by ~ Then you can write a logic on top of this to convert the same to csv file(which i haven't done)

cat ip_file.txt | tr '\n' '~' | tr '[' '\n'

NOTE : Assuming [ wont come in between the records

Raghuram
  • 3,937
  • 2
  • 19
  • 25
  • nope buddy. .. that's just spoiling the format i've already created! – Kiran Vemuri Jun 15 '12 at 10:34
  • can we use the match function to pull it off? i am just confused and struggling with it.. this is what happens if a noob like me starts working on some serious stuff! – Kiran Vemuri Jun 15 '12 at 10:42
0

I have to admit this is a somewhat complicated scenario, where you have to cope with multiple line fields and below requirements come in my mind:

  • Each field may span multiple lines
  • Special format is expected for output, here is the CSV format, a.k.a comma separated texts
  • Escape characters for CSV
  • Some asumption about certain field format, like phone numbers may begin with ph:, and address numbers might begin with street number, etc

Here is a code snippet for your reference:

#!/usr/bin/awk -f
BEGIN{
    RS="\.\s* \[[0-9]+\]";
    FS="\n";
    OFS=",";
}

function find_next_field_until_regex(regex, i, result){
    result = "";
    for (; i < NF; i++){
        field = $i
        sub(/,/, "\,", field);
        sub(/^[ \t]*/, "", field);
        if (field ~ regex){
            break;
        }
        result = result field;
    }
    printf("%s%s", result, OFS);
    return i;
}

{
    if(NF>1){
        sub(/,/, "\,", $1);
        printf("%s%s", $1, OFS);
        i = 2;
        i = find_next_field_until_regex("^[0-9]+", i); #discription
        i = find_next_field_until_regex("^ph: ", i); #address
        i = find_next_field_until_regex("www\\.", i); #phone
        for (; i < NF; ++i){
            printf("%s", $i);
        }
    }
    printf("\n");
}

Also check gist snippet.

Fei
  • 1,450
  • 1
  • 17
  • 30
0
awk '$1 ~ /\. \[/ {
 sub(/\. \[[0-9]*]/, "", $1)
 if ($2 ~ /^ *[0-9]/) $2 = OFS$2
 n = split($0, a, OFS)
 while (a[3] !~ /^ *[0-9]/)
 {                       
  a[2] = a[2]a[3]
  for (i=3; i<=n; ++i) a[i]=a[i+1]
  --n                              
 }   
 print a[1],a[2],a[3],a[4],a[5] }' RS= FS='\n' OFS='|' Myfile > excel.csv
Armali
  • 18,255
  • 14
  • 57
  • 171