0

I have a delimited file like the below

donaldtrump   23  hyd  tedcruz      25  hyd  james       27  hyd  

the first three set of fields should be one record ,second 3 set of fields are one record and so on...what is the best way in loading this file into a hive table like below(emp_name,age,location)

Nishu Tayal
  • 20,106
  • 8
  • 49
  • 101
Peter
  • 43
  • 6
  • You need to do some pre-processing . What is the roq delimiter ? Is all data only in single line. – Abhi May 06 '16 at 20:55

1 Answers1

0

A very, very dirty way to do that could be:

  1. design a simple Perl script (or Python script, or sed command line) that takes source records from stdin, breaks them into N logical records, and push these to stdout
  2. tell Hive to use that script/command as a custom Map step, using the TRANSFORM syntax -- the manual is there but it's very cryptic, you'd better Google for some examples such as this or that or whatever

Caveat: this "streaming" pattern is rather slow, because of the necessary Serialization / Deserialisation to plain text. But once you have a working examople, the development cost is minimal.

Additional caveat: of course, if source records must be processed in order -- because the logical records can spill on the next row, for example -- then you have a big problem, because Hadoop may split the source file arbitrarily and feed the splits to different Mappers. And you have no criteria for a DISTRIBUTE BY clause in your example. Then, a very-very-very dirty trick would be to compress the source file with GZIP so that it is de facto un-splittable.

Community
  • 1
  • 1
Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36