0

How do we build Normalized table from DeNormalized text file one?

Thanks for your replies/time.

We need to build a Normalized DB Table from DeNormalized text file. We explored couple of options such as unix shell , and PostgreSQL etc. I am looking learn better ideas for resolutions from this community.

The input text file is various length with comma delimited records. The content may look like this:

XXXXXXXXXX , YYYYYYYYYY, TTTTTTTTTTT, UUUUUUUUUU, RRRRRRRRR,JJJJJJJJJ
111111111111,   22222222222, 333333333333, 44444444, 5555555, 666666
EEEEEEEE,WWWWWW,QQQQQQQ,PPPPPPPP

We like to normalize as follows (Split & Pair):

XXXXXXXXXX , YYYYYYYYYY
TTTTTTTTTTT, UUUUUUUUUU
RRRRRRRRR,JJJJJJJJJ
111111111111,   22222222222
333333333333, 44444444
5555555, 666666
EEEEEEEE,WWWWWW
QQQQQQQ,PPPPPPPP

Do we need to go with text pre-process and Load approach?

If yes, what is the best way to pre-process?

Are there any single SQL/Function approach to get the above?

Thanks in helping.

2 Answers2

4

Using gnu awk (due to the RS)

awk '{$1=$1} NR%2==1 {printf "%s,",$0} NR%2==0' RS="[,\n]" file
XXXXXXXXXX,YYYYYYYYYY
TTTTTTTTTTT,UUUUUUUUUU
RRRRRRRRR,JJJJJJJJJ
111111111111,22222222222
333333333333,44444444
5555555,666666
EEEEEEEE,WWWWWW
QQQQQQQ,PPPPPPPP

{$1=$1} Cleans up and remove extra spaces
NR%2==1 {printf "%s,",$0} prints odd parts
NR%2==0 prints even part and new line
RS="[,\n]" sets the record to , or newline

Jotne
  • 40,548
  • 12
  • 51
  • 55
  • Thanks for your suggestion. I tried. It works. **Will RS =.. put ^M for some records at the end?** I have the same format with lot of trailing commas at the end of each records. Like 'XXXXXXXXXX , YYYYYYYYYY, TTTTTTTTTTT, UUUUUUUUUU, RRRRRRRRR,JJJJJJJJJ,,,,,,,,,,,,' How do I process this file using your awk command? – user2647763 - RIMD Dec 10 '13 at 02:42
  • It will not add `^M`. You can just test for `$0` to not get the extra commas like this: `awk '{$1=$1} NR%2==1 && $0 {printf "%s,",$0} NR%2==0 && $0' RS="[,\n]" file` – Jotne Dec 12 '13 at 11:45
1

Here is an update. Here is what I did in Linux server.

    sed -i 's/\,,//g' inputfile   <------ Clean up lot of trailing commas

    awk '{$1=$1} NR%2==1 {printf "%s,",$0} NR%2==0' RS="[,\n]" inputfile <----Jotne's idea

    dos2unix -q -n inputfile outputfle <------ to remove ^M in some records

    outputfile is ready to process as comma delimited format  

Any thoughts to improve above steps further?

Thanks in helping.