1

I want to ingest large csv files(up to 6 GB) on a regular basis into a Hadoop single node with 32 GB RAM. They key requirement is to register the data in HCatalog. (Please do not discuss requirements, it is a functional demo). Performance is not essential. The hive tables shall be partitioned.

So far I was using Pig. Lessons learned so far is that the main challenge is the Heap. The generated MapReduce jobs fill up the heap quickly and once Java is 98% of the time garbage collecting, there is an overflow.

One solution might be to chunk the large files into smaller pieces... However, I also consider that a different technology than Pig might not fill up the Heap that much. Any ideas on how to approach such a use case? thx

Stefan Papp
  • 2,199
  • 1
  • 28
  • 54

1 Answers1

1

The best thing for this is using HiveQL instead of Pig(LOAD). It is based just on filetransfer, no MR jobs

Stefan Papp
  • 2,199
  • 1
  • 28
  • 54