-1

I have to handle pipe delimited flat files, in which each field comes within double quotes.

sample data:

"1193919"|"false"|""|"Mr. Andrew Christopher Alman"|""|""|"Mr."

I have written many gawk commands in my scripts. Now the issue is:

issue: Consider this row: "1193919|false||Mr. Andrew Christopher Alman"|""|"Mr." My script is taking the above as 6 different fields

"1193919
false
[null]
Mr. Andrew Christopher Alman"
[null]
"Mr."

But the data files are sent with the intent that "1193919|false||Mr. Andrew Christopher Alman" should be taken as one field, as surrounded by double quotes.

My thought: I was thinking to change the field separator from | to "|" This has few issues. The last and first fields will come as "1193919 and Mr." i dont want to use '["][|]["]|^["]|["]$' as field separator, because this will increase the number fields and my other codes will have to go though a major change.

I am asking for a solution something like: Use | as a field separator only if it is followed by " and preceded by ". But the field separator will be | and not "|"

issue 2:

"1193919""|"false"""|""|"Mr. Andrew Christopher Alman"

At the same time I want to report an error for "false""", something like /^"["]+ | ["]+["]$/ and not /^""$/ Good data should be in below format

"1193919"|"false"|""|"Mr. Andrew Christopher Alman"
DPR
  • 25
  • 6
  • good solutions below, but why send data with field delimters embedded in fields. Use a value for FS that isn't in the data would be another solution to this problem. Good luck. – shellter Sep 27 '16 at 18:29
  • Sometimes you just have data in that way, a data field may just have a delimiter as a data, and you just gotta handle it. I'm writing this scripts for handling huge amount of data(for a data warehouse). And I expect all sorts of data. Any thoughts on issue 2 solution? – DPR Sep 28 '16 at 06:20

2 Answers2

2

you can use gawk's FPAT variable to define quoted fields

$ gawk -v FPAT='[^|]*|"[^"]*"' '{print $1}'

and add your logic around the number of field etc.

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • Thank you. But how can I overcome issue 2, as I have mentioned in the question? What is the actual syntax to convey the below Pseudocode. /^"["]+ | ["]+["]$/ and not /^""$/ – DPR Sep 28 '16 at 06:03
  • Could you please explain your solution? What I understand is, the data field can not be "|", or it can not be " within double quotes. But I want to write the data field can also be "|" if within double quotes. I want "|" should not be considered as a special character(field separator) when inside double quotes – DPR Sep 28 '16 at 07:45
  • Now I understand the solution. But I dont have FPAT option with gawk in my system. ANy alternative? – DPR Sep 28 '16 at 08:00
1

The main idea is to handle all irregularities before awk (because many irregular cases are possible and awk works best on regular files).

You can replace specific patterns with a unique symbol that doesn't occur within fields and then use it as a field delimiter:

sed 's/"|"/"\t"/g' file.txt |\
awk -F '\t' '{for(i = 1; i <= NF; i++){print i, $i} }'

I'd use something that is highly unlikely to occur in a text, e.g. vertical tab \v. If you are not sure about contents of the fields, then you can determine a symbol that is not present in the current chunk of data and process it with this symbol as a delimiter.

The same approach works for issue 2. If you know that some patterns are incorrect, then you can either exclude or fix them before processing, e.g. with

sed 's/\([^|"]\)"\+|/\1"|/g'
Andrey
  • 2,503
  • 3
  • 30
  • 39
  • Thanks Andrey. But I'm writing to handle huge amount of data for a data warehouse, and I have no idea, what wont be there in the data field. And any thoughts for the issue 2 as I've mentioned in the question.. Thank you – DPR Sep 28 '16 at 06:09
  • Thank you Andrey. Do you have any idea, what are other characters that are very unlikely to occur in a text? – DPR Sep 30 '16 at 06:27
  • It depends on the data. If your environment supports Unicode, you can play with non-characters, e.g. try `echo -e 'a\uffffb' | awk -F "$(echo -e "\uffff")" '{for(i = 1; i <= NF; i++){print i, $i} }'`. – Andrey Sep 30 '16 at 17:37