0

I went through many pages on Stackoverflow regarding this. But still I am confused. Even if this is a duplicate question or a similar one, Please answer

I want to compare one file against another in Pig and I want one of the files to be in distributed cache so that every mapper has it locally. How to implement it in Pig.

Pooja3101
  • 701
  • 3
  • 8
  • 13
  • Can you clarify what you mean by "compare"? – reo katoa Feb 20 '14 at 15:05
  • Use a LOAD UDF (you'll probably have to write it, though) – Chris Gerken Feb 20 '14 at 15:25
  • possible duplicate of [Accesing file in Mapper through Distributed Cache](http://stackoverflow.com/questions/21882583/accesing-file-in-mapper-through-distributed-cache) – vefthym Feb 20 '14 at 15:32
  • Lets say I have a file A. I have a new file B which has same structure as A and has some updated records of A based on 1st column. So what I was thinking is I will put the old file in Cache so that every mapper has it locally and compare it with new one(which is divided among mappers) so that I can filter out the updated records. But i have no idea how to do it Pig. Please help – Pooja3101 Feb 20 '14 at 15:40
  • Lets just say I want to add a file to Distributed Cache in Pig and read from it. How can I do that? – Pooja3101 Feb 20 '14 at 15:48

1 Answers1

2

use the following

set mapred.cache.files /new_file_location/new_file.txt#new_file.txt

ship to location where each mapper runs.

jeff
  • 23
  • 3