1

I know there are a lot of resources for using distributed cache in Pig scripts with Java-udfs. But I haven't found anything that would explain the same with python udfs. Also, I have not found any detailed explanation of distributed cache usage while writing pig scripts.

I am not asking for any question. But I want to have a place where people like me can get their first pig + python + distributed cache example working. I am sorry if I am unknowingly asking the wrong question. But I will be very thankful for the help.

Thanks, r0ger22

Roger
  • 2,823
  • 3
  • 25
  • 32
  • Pig Streaming Functions - is what you should search for. refer http://arnon.me/2013/03/herding-apache-pig-pig-perl-python/ and https://wiki.apache.org/pig/PigStreamingFunctionalSpec – vijay kumar Jul 30 '15 at 12:45
  • Pig Streaming Functions is a way of launching pig josb like hadoop Streaming. Ship feature in pig will ship your python code to all nodes (like distributed cache) – vijay kumar Jul 30 '15 at 12:53
  • @ramisetty.vijay Ok. I got that. What I am looking for is (and perhaps I should have mentioned that in the question specifically; apologies for that): I have a config file, which I must load in memory. Right now, what i am doing is - I have kept this file on my hdfs cluster. Then, LOAD it in my PIG Job and then join it whenever needed. But I would ideally want to not keep this on hdfs but SHIP it to all nodes (just like python code is shipped). I am having trouble with shipping the config files, basically. – Roger Jul 30 '15 at 21:26

0 Answers0