how to define a space in a regular expression (in awk)?

Question

I want to print the texts inside of " ". for example I have the following strings:

gfdg "jkfgh" "jkfd fdgj fd-" ghjhgj
gfggf "kfdjfdgfhbg" "fhfghg" jhgj
jhfjhg "dfgdf" fgf
fgfdg "dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd" hgjghj

And I want to print only the following:

"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd"

I have tried awk with the following regular expression:

awk '{for(i = 1; i <= NF; i++) if($i ~ /^\"[A-Za-z.$]*([A-Za-z.$][[:space:]]*[A-Za-z.$])*\"$/) print $i}' sample.txt

but it prints everything before space and actually does not recognize the spaces I have defined in my regular expression. My current output is:

"jkfgh"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj

as you can see, only the ones without any space are printed correctly.

I have also tried [[:blank:]], \t and also ' ' but did not work.

I appreciate if someone can tell me how to change this regular expression and include space.

Do you want to print all fields except the first and the last for each line? — Tom Fenech, Apr 08 '15 at 11:33
i want to print only the texts inside " "s including spaces. — h ketab, Apr 08 '15 at 11:38
@hketab: I think you still have misconceptions; see my updated answer for a clarification, and, if you can use _GNU_ `awk`, a solution. You should also update your question to indicate that you want to extract _single_-quoted strings too, and that it is _incidental_ in the sample data that only the _interior_ tokens are quoted. — mklement0, Apr 10 '15 at 14:56
possible duplicate of [multiple field separator single quotes ' ' and double quotes " " in awk](http://stackoverflow.com/questions/29559774/multiple-field-separator-single-quotes-and-double-quotes-in-awk) — mklement0, Apr 10 '15 at 18:15

mklement0 · Answer 1 · 2021-05-12T14:08:15.407

The question's title is misleading and based on a fundamental misconception about awk.

The naïve answer is that a space can simply be represented as itself (a literal) in regular expressions in awk.
More generally, you can use [[:space:]] to match a space, a tab or a newline (GNU Awk also supports \s), and [[:blank:]] to match a space or a tab.

However, the crux of the problem is that Awk, by default, splits each input line into fields by whitespace, so that, by definition, no input field itself contains whitespace, so any attempt to match a space in a field value will invariably fail.

The input at hand has fields that are a mix of unquoted and quoted strings, but POSIX Awk has no support for recognizing quoted strings as fields.

@fedorqui has made a valiant attempt to work around the problem by splitting input into fields by double quotes, but it's no substitute for proper recognition of quoted strings, because it doesn't preserve the true field boundaries.

If you have GNU Awk, you can approximate recognition of quoted strings using the special FPAT variable, which, rather than defining a separator to split lines by, allows defining a regex that describes fields (and ignores tokens not recognized as such):

re='[[:alpha:]][[:alpha:] ]*[[:alpha:]]' # aux. shell variable
gawk -v FPAT="\"$re\"|'$re'" '{
  for(i=1;i<=NF;++i) printf "%s%s", $i, (i==NF ? "\n" : " ") 
}' sample.txt

This will work with single- and double-quoted strings.

Explanation:

FPAT="\"$re\"|'$re'" defines fields to be either double- or single-quoted strings consisting only of letters and spaces, with at least one letter on either end (as in the OP's code).
Note that this automatically excludes the UNquoted tokens on each input line - they will not be reflected in $1, ... and NF.
Therefore, the loop for(i=1;i<=NF;++i) is already limited to enumerating only the matching fields.

Note that, generally, the restrictions placed on the contents of the quoted strings in this case luckily bypass limitations inherent in this approach, namely the inability to deal with escaped nested quotes (of the same type).

If this limitation is acceptable, you can use the following idiom to tokenize input that is a mix of barewords (unquoted tokens) and quoted strings:

gawk -v "FPAT=[^[:blank:]]+|\"[^\"]*\"|'[^']*'" ...

fedorqui · Answer 2 · 2015-04-08T12:16:29.413

4

You are just getting those without any space because you loop through fields and they are space separated. Thus, you need to change the approach to something handling the spaces differently. Assuming there are no nested quotes, you can use for example:

awk -F'"' '{for (i=2;i<NF;i+=2) printf "\"%s\"", $i; print ""}' file

That is, use " as field separator and print the even fields.

This is equivalent to using FS more elegantly:

awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s", FS, $i, FS; print ""}' file

Note in the previous approaches the output has no space in between fields. If you need it, you can use:

awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s%s", FS, $i, FS, (i>NF-2?"\n":" ")}' file

The trick (i>NF-2?"\n":" ") is a matter of printing the whole field together with a separator. If we are in the last field, we set it as new line; otherwise, as a space. More idiomatically, you can also say (i>NF-2?RS:OFS) using the default values of RS (record separator, new line) and OFS (output field separator, space).

Test

$ awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s%s", FS, $i, FS, (i>NF-2?"\n":" ")}' file
"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd"

edited Apr 08 '15 at 12:16

answered Apr 08 '15 at 11:08

fedorqui

275,237
103
548
598

+ for the explanation; note that you're printing quotes in _both_ cases, (first, hard-coded into the format string, then with `FS`); there's no space between the output fields. If you append a space to the format string to fix that, you'll have to treat the last output field special to avoid a _trailing_ space. – mklement0 Apr 08 '15 at 11:54
1

@mklement0 fair point. I felt like golfing a little bit and added a solution with spaces in between strings. Thanks for the input! – fedorqui Apr 08 '15 at 12:17
yes I have problem with trailing space but not because of appending a space to the format string. I have changed the command to 'i<5' instead of 'i – h ketab Apr 08 '15 at 12:18
@hketab you can maybe change the `for` definition to `for (i=2;i<5 && i – fedorqui Apr 08 '15 at 12:22
@hketab: Not sure I fully understand, but if you're getting the right fields and your only problem is trailing spaces, a quick solution is to pipe Awk's output to `| sed 's/ *$//'`; a more efficient solution is to build the output as a _string_ in Awk first, then apply `sub(" *$", "", s)` to it, then print it. – mklement0 Apr 08 '15 at 12:47
I noticed that I have to consider single quotes `(' ')` as well. I have to use a regular expression including ", ' and also spaces. I tried \s as suggested in many websites for space, but did not work. any solution for this problem? – h ketab Apr 09 '15 at 09:22
1

@hketab since you are changing the requirements all the time, it might be best to start a brand new question. – fedorqui Apr 09 '15 at 09:55
i think from the beginning i asked for a regular expression including space. – h ketab Apr 09 '15 at 10:04
1

@hketab `i want to print only the texts inside " "s including spaces` http://stackoverflow.com/questions/29512854/how-to-define-a-space-in-a-regular-expression-in-awk/29512928#comment47182704_29512854 You should also read [How do I ask a good question](http://stackoverflow.com/help/how-to-ask). Good luck. – fedorqui Apr 09 '15 at 12:35

how to define a space in a regular expression (in awk)?

2 Answers2

Test

Linked