how to load the fixed width data where multiple records are in one line

Question

I have a delimited file like the below

donaldtrump   23  hyd  tedcruz      25  hyd  james       27  hyd

the first three set of fields should be one record ,second 3 set of fields are one record and so on...what is the best way in loading this file into a hive table like below(emp_name,age,location)

You need to do some pre-processing . What is the roq delimiter ? Is all data only in single line. — Abhi, May 06 '16 at 20:55

score 0 · Answer 1 · edited May 23 '17 at 12:07

A very, very dirty way to do that could be:

design a simple Perl script (or Python script, or sed command line) that takes source records from stdin, breaks them into N logical records, and push these to stdout
tell Hive to use that script/command as a custom Map step, using the TRANSFORM syntax -- the manual is there but it's very cryptic, you'd better Google for some examples such as this or that or whatever

Caveat: this "streaming" pattern is rather slow, because of the necessary Serialization / Deserialisation to plain text. But once you have a working examople, the development cost is minimal.

Additional caveat: of course, if source records must be processed in order -- because the logical records can spill on the next row, for example -- then you have a big problem, because Hadoop may split the source file arbitrarily and feed the splits to different Mappers. And you have no criteria for a DISTRIBUTE BY clause in your example. Then, a very-very-very dirty trick would be to compress the source file with GZIP so that it is de facto un-splittable.

how to load the fixed width data where multiple records are in one line

1 Answers1