multiple field separator single quotes ' ' and double quotes " " in awk

Question

I had asked before for printing the texts inside of two consecutive " ". for example I have the following strings:

gfdg "jkfgh" "jkfd fdgj fd-" ghjhgj
gfggf "kfdjfdgfhbg" "fhfghg" jhgj
jhfjhg "dfgdf" fgf
fgfdg "dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd" hgjghj

And I want to print only the following:

"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd"

I got the answer of using this command:

awk -F'"' '{for (i=2;i<5;i+=2) printf "%s%s%s%s", FS, $i, FS, (i>5-2?"\n":" ")}' sample.txt

now I have to add ' ' to my question. i.e. my text can be inside of ' ' as well as " ". the example below:

gfdg "jkfgh" "jkfd fdgj fd-" ghjhgj
gfggf "kfdjfdgfhbg" "fhfghg" jhgj
jhfjhg "dfgdf 'ffdg' gfd" "dgffd 'fdg'"fgf
fgfdg 'dfj "jfdg" jhfgjd' 'hfgdh jfdhgd jkfghfd' hgjghj

i would like to get the following result:

"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf 'ffdg' gfd" "dgffd 'fdg'"
'dfj "jfdg" jhfgjd' 'hfgdh jfdhgd jkfghfd'

can someone please help me?

@karthikmanchala i have used the command above. this works only for " " and if I change the field separator to -F" ' " will work also for ' ' but I want both single and double quotes working together. then I used -F"^' | \" " to have both field separator but did not good a right result. — h ketab, Apr 10 '15 at 11:05
You could use multiple delimiters in awk. eg `awk -F'[/=]' '{print $3 "\t" $5 "\t" $8}'` — terminal ninja, Apr 10 '15 at 11:06
What would be the output if the input is `"foo"bar'buz'bar"foo'bar'"` ? — Avinash Raj, Apr 10 '15 at 11:06
@AvinashRaj the output is : `"foo"bar'buz'bar"foo'bar'"` i.e. whole text. because I have always space between two signle quotes or double quotes. — h ketab, Apr 10 '15 at 11:10
i think avinash means.. `"foo" bar 'buz' bar "foo 'bar'"`.. should the bar's be included in this case? — karthik manchala, Apr 10 '15 at 11:18
Can you have a newline in the text inside your quotes? Can you have an "escaped" quote, e.g. either `"foo\"bar"` or `"foo""bar"` are common escaping-constructs in CSVs? — Ed Morton, Apr 10 '15 at 12:03

score 3 · Answer 1 · answered Apr 10 '15 at 14:37

3

{
  a = ""
  s = $0
  # while s contains a delimiter (either " or ')
  while (match(s, /['"]/)) {
    # save the delimiter
    c = substr(s, RSTART, 1)
    # remove up to and including the delimiter
    s = substr(s, RSTART + 1)
    # find the matching delimiter
    i = index(s, c)
    # append the saved delimiter and the first segment of s to the accumulator
    a = a " " c substr(s, 1, i)
    # remove the segment
    s = substr(s, i + 1)
  }
  # print the accumulator (dropping the first space)
  print substr(a, 2)
}

answered Apr 10 '15 at 14:37

dave sines

31
1

Nice approach - +1! Wish you'd used some longer, more meaningful variable names though, and I think there's some intermediate steps in there that aren't helping the clarity so I tweaked it and added an alternative implementation at the end of my answer (http://stackoverflow.com/a/29561731/1745001). – Ed Morton Apr 10 '15 at 16:53

score 2 · Answer 2 · edited May 23 '17 at 12:29

Simplest thing is probably to go one char at a time:

$ cat tst.awk
BEGIN { FS="" }
{
    rec = ""
    for (i=1;i<=NF;i++) {
        if ( ($i=="\"") && !inSq ) {
            rec = rec (inDq ? $i : (rec ? " " : ""))
            inDq = !inDq
        }
        else if ( ($i=="'") && !inDq ) {
            rec = rec (inSq ? $i : (rec ? " " : ""))
            inSq = !inSq
        }

        if ( inDq || inSq ) {
            rec = rec $i
        }
    }
    print rec
}

$ awk -f tst.awk file
"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf 'ffdg' gfd" "dgffd 'fdg'"
'dfj "jfdg" jhfgjd' 'hfgdh jfdhgd jkfghfd'

There may be an RE you could use with FPAT in gawk instead but I can't be bothered to think about it. The above can be made to work even when there's newlines inside your quotes in various ways, including by reading the whole file as one record using RS='^$' in gawk.

I really like Dave Sines' answer (https://stackoverflow.com/a/29564199/1745001) but thought it could be a bit more concise so I massaged it to this:

$ cat tst.awk
{
    rec = ""
    while (match($0,/['"]/)) {
        delim   = substr($0,RSTART,1)
        fldLgth = index(substr($0,RSTART+1),delim) + 1
        rec     = (rec ? rec " " : "") substr($0,RSTART,fldLgth)
        $0      = substr($0,RSTART+fldLgth)
    }
    print rec
}
$ awk -f tst.awk file
"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf 'ffdg' gfd" "dgffd 'fdg'"
'dfj "jfdg" jhfgjd' 'hfgdh jfdhgd jkfghfd'

If you like that then please accept dave's answer and just refer to this as an alternative implementation.

Thanks a lot. does it work for only two consecutive ' ' or " " ? — h ketab, Apr 10 '15 at 13:38
@hketab what does that mean? If you have some other cases that you haven't captured in your posted sample input then edit your question to show those cases. — Ed Morton, Apr 10 '15 at 15:05
I've posted an `FPAT`-based answer; if you're up for looking at it, please let me know if there are any problems with it. — mklement0, Apr 10 '15 at 17:42
Nicely done; both solutions are POSIX-compliant, from what I can tell. — mklement0, Apr 10 '15 at 17:47

score 2 · Answer 3 · edited May 23 '17 at 12:29

To quote the - adapted - core of my answer at https://stackoverflow.com/a/29513125/45375, where you've asked essentially the same question (only obfuscated by some misconceptions).

If you have GNU Awk, you can approximate recognition of quoted strings using the special FPAT variable, which, rather than defining a separator to split lines by, allows defining a regex that describes fields (and ignores tokens not recognized as such):

gawk -v FPAT="\"[^\"]*\"|'[^']*'" '{
  for(i=1;i<=NF;++i) printf "%s%s", $i, (i==NF ? "\n" : " ")
}' sample.txt

This will work with single- and double-quoted strings, but does not support embedded escaped quotes of the same type.

Explanation:

FPAT="\"[^\"]*\"|'[^']*'" defines fields to be either double- or single-quoted strings, even empty ones.
Note that this automatically excludes the UNquoted tokens on each input line - they will not be reflected in $1, ... and NF.
Therefore, the loop for(i=1;i<=NF;++i) is already limited to enumerating only the matching fields. Fields do include the enclosing quotes, as desired here.

Looks good `+1`. You could support empty fields too by just changing the `+`s to `*`s. FWIW I usually write the print in the loop as `printf "%s%s", $i, (i — Ed Morton, Apr 10 '15 at 17:47

score 0 · Answer 4 · answered Apr 12 '15 at 03:31

The true requirements are shrouded in a mist of confusion, but the topic of robustly and generically parsing whitespace-separated tokens that may be double- or single-quoted is an interesting one.

Even though it can be done with awk, it is cumbersome, as evidenced by the existing answers; awk's field-parsing features do not directly support quoted strings.

Here's a much simpler perl solution, which utilizes the Text::Parsewords module - which may or may not come with your perl distribution (e.g., preinstalled on OSX 10.10, but not on Ubuntu 14.04):

perl -MText::Parsewords -lne '
  my @flds = Text::ParseWords::parse_line("\\s+", 1, $_);
  print join(" ", grep /^["\047]/, @flds);
' sample.txt

Text::ParseWords::parse_line("\\s+", 1, $_) parses each input line ($_) into tokens, based on whitespace as the separator, recognizing both single- and double-quoted strings, with support for \-escaped embedded quotes of the same type; the 1 as the 2nd argument indicates that the quotes should be retained.
grep /^["\047]/, @flds matches and returns only those tokens that start with " or ' (' is represented as escape sequence \047, because a ' cannot be directly embedded in a single-quoted shell string).
print join(" ", ... joins the result tokens with a space as the separator and prints the result.

Caveat: This solution differs from the OP's desired sample output in one respect: "dgffd 'fdg'"fgf is recognized as a token as a whole, not just the "dgffd 'fdg'"prefix.
If you really only want the prefix in this scenario, use the following as the Perl script's 2nd line, but note that doing so means that the extraction will malfunction with embedded escaped quotes:

print join(" ", map { s/^((["\047]).*\2).*/$1/r } grep /^["\047]/, @flds);

score 0 · Answer 5 · answered Apr 12 '15 at 15:34

Since a specific comment-question on your other question (implicitly) denied that it was only the first an last words that you wanted to exclude, and since none of your (limited) examples show embedded bare text which is not required:

BEGIN {
    FS = ""
}
{
for (CharFromStart=1;CharFromStart<=NF;CharFromStart++) {
        if ( $CharFromStart ~/"|'/) {
           break
        }
    }
for (CharFromEnd=NF;CharFromEnd>0;CharFromEnd--) {
        if ( $CharFromEnd ~/"|'/) {
           break
        }
    }
if ( CharFromStart <= CharFromEnd ) {
    print ">"substr($0,CharFromStart,(CharFromEnd-CharFromStart+1))"<"
    }
else {
    print "Move along please, nothing to see here"
    }
}

With some augmented test data:

gfdg "jkfgh" "jkfd fdgj fd-" ghjhgj
gfggf "kfdjfdgfhbg" "fhfghg" jhgj
jhfjhg "dfgdf 'ffdg' gfd" "dgffd 'fdg'"fgf
fgfdg 'dfj "jfdg" jhfgjd' 'hfgdh jfdhgd jkfghfd' hgjghj
jhfjhg "dfgdf 'ffdg' gfd" "dgffd 'fdg'" fgf
jhfjhg "dfgdf         'ffdg   ' gfd"        "        dgffd 'fdg'"fgf
kiuj jajdj "dfgdf         'ffdg   ' gfd"        "        dgffd 'fdg'" s fgf
dslkjflkdsj ldsk gfdkg ;kdsa;lfkdsl f ljflkdsjf l
ldsfl dsjfhkjds dshfjkhds kdskjfhdskjhf " dsflkdsjflk
' dlfkjdslfj kdsjflkdslj djlkfjdslkjf 
dskfjds dshfdkjsh dshjkjfhds "
"""

Gives:

>"jkfgh" "jkfd fdgj fd-"<
>"kfdjfdgfhbg" "fhfghg"<
>"dfgdf 'ffdg' gfd" "dgffd 'fdg'"<
>'dfj "jfdg" jhfgjd' 'hfgdh jfdhgd jkfghfd'<
>"dfgdf 'ffdg' gfd" "dgffd 'fdg'"<
>"dfgdf         'ffdg   ' gfd"        "        dgffd 'fdg'"<
>"dfgdf         'ffdg   ' gfd"        "        dgffd 'fdg'"<
Move along please, nothing to see here
>"<
>'<
>"<
>"""<

This works by setting the FS built-in variable for the Field Separator to nothing. This causes each character on the line to be treated as an individual field.

A loop "up" the line using $variablename to find the first quote or apostrophe. A loop "down" the line to find the last quote or apostrophe.

A quick check that at least one was found, and print the substring of the line from the first quote or apostrophe to, and including, the last.

Where there is only one quote or apostrophe on the line, it will be printed, but simple to not do so.

If quote or apostrophe is "unbalanced", no problem with the extraction (unless you want to actually know). Embedded blanks, tabs or such-like will stay where they are, relative to the first quote or apostrophe.

score 0 · Answer 6 · answered Sep 09 '15 at 09:06

0

A simple method

awk '{$1="";sub(/^ /,"")sub(/fgf/,"")}NR!=3{NF=NF-1}1' file
    "jkfgh" "jkfd fdgj fd-"
    "kfdjfdgfhbg" "fhfghg"
    "dfgdf 'ffdg' gfd" "dgffd 'fdg'"
    'dfj "jfdg" jhfgjd' 'hfgdh jfdhgd jkfghfd'

answered Sep 09 '15 at 09:06

Claes Wikner

1,457
1
9
8

multiple field separator single quotes ' ' and double quotes " " in awk

6 Answers6

Linked