Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
1
vote
0 answers

How to determine number of dynamic partitions in Hive

I was executing an insert statement for a table that was partitioned and bucketed but during the run, it threw an error about the number of dynamic partitions -- namely, that there were not enough. So, I set as follows: set…
Anoop Mamgain
  • 187
  • 2
  • 3
  • 13
1
vote
1 answer

pig regex to extract data between tags

My text file (input): City,Description Chicago,One day car rental is $90 Dallas,One day car rental is $65 Output needed: City Costofrental Chicago, $90 Dallas, $65 I am using regex extract to get the cost ($) details but not…
piyer96
  • 11
  • 1
  • 5
1
vote
0 answers

parameter substitution PIG relation

i am facing a lot of difficulties trying to load certain directories and process them. the idea is i want to process all unprocessed files. in order to do so, i store my process timestamp inside hdfs everytime i finished processing. that way it'll…
kenlz
  • 461
  • 7
  • 22
1
vote
1 answer

How To Pass CommandLine Argument Using MapReduce Native in Pig

I am invoking Mapreduce Job from Apache Pig using NaativeMapreduce (https://wiki.apache.org/pig/NativeMapReduce) MY Question is how to pass it arguments like in command line . Eg: If I had a Mapreduce Class, whose driver I am invoking from command…
Argho Chatterjee
  • 579
  • 2
  • 9
  • 26
1
vote
1 answer

How do I find the diff between two bags of pairs (A,B) where neither A nor B are unique?

Suppose I have a bag NEW that contains many pairs (A, B): Pair 1: { "A" : { "long" : someInteger1 }, "B" : { "int" : someInteger2 } } Pair 2: { "A" : { "long" : someInteger3 }, "B" : { "int" : someInteger4 } } ...... I have another bag OLD, which…
Brian Schmitz
  • 1,023
  • 1
  • 10
  • 19
1
vote
1 answer

PIG not reading file from hdfs when running from pig script

I am trying to load a file from hdfs using a pigscript data = LOAD '/user/Z013W7X/typeahead/time_decayed_clickdata.tsv' using PigStorage('\t') as (keyword :chararray , search_count: double, clicks: double, cartadds: double); the path mentioned…
1
vote
0 answers

apache pig using mapreduce java getting exception

I am using pig-0.15 and hadoop 2.6. While connecting to HDFS using apache pig through mapreduce, I get the following exception: Exception in thread "main" java.lang.RuntimeException: Failed to create DataStorage public static void main(String[]…
kartik
  • 71
  • 1
  • 8
1
vote
0 answers

Resolving arrays of tuples in Pig

I am trying to transform ({(A1),(A2)},{(A1-002),(A2-046)},{(124,323)}) into: (A1,A1-002,124) (A1,A1-002,323) (A2,A2-046,124) (A2,A2-046,323) So that for each of the third elements, the first two elements are paired up in order. I originally…
1
vote
1 answer

want to aggregate the values from two files that are already parsed xml file using pig

first file contain the following cl_id date TM c_id c_val 10201 2015-4-15 01:00:00 56707065 0 10201 2015-4-15 01:00:00 56707066 1 10201 2015-4-15 01:00:00 56707067 200 like wise there are multiple cl_id and for…
Deepak Patil
  • 99
  • 1
  • 11
1
vote
2 answers

Getting Name Value JSON in PIG

Hi guys i just started doing pig, I was wondering if JsonLoader is capable of parsing all value inside json. for example: {"food":"Tacos", "person":"Alice", "amount":3} and i need to get "food" stored as a relation in chararray and "Tacos" which…
kenlz
  • 461
  • 7
  • 22
1
vote
1 answer

pig script to sample 10 chunks of training data, pig script is jammed

BACKGROUND I have a binary classification task where the data is highly imbalanced. Specifically, there are way more data with label 0 than that with label 1. In order to solve this problem, I plan to subsampling data with label 0 to roughly match…
xuan
  • 270
  • 1
  • 2
  • 15
1
vote
1 answer

Invoke pig with oozie - org.apache.pig.Main exit code [2]

I am trying to invoke a Pig action in Oozie and I am working with the following- Oozie v3.3.2 Pig v0.12.1-mapr Hadoop v1.0.3 mapr M5 I am able to invoke a java action using Oozie as of now. However, when I try to invoke a Pig action, its failing…
thisdotnull
  • 812
  • 1
  • 7
  • 20
1
vote
1 answer

deploy Python pip package on Hadoop?

Write a Python UDF for Hadoop/Pig, and need to use some Python libraries like "request" which I installed locally by pip when doing local box UDF testing. Wondering how to deploy the pip package on Hadoop cluster so that no matter my Python UDF runs…
Lin Ma
  • 9,739
  • 32
  • 105
  • 175
1
vote
1 answer

java.lang.ClassCastException: java.lang.Boolean cannot be cast to org.apache.pig.data.Tuple

i am getting following error while running pig script. My script is running fine in grant shell. i am getting this error while running through 'time pig' . pig version - Apache Pig version 0.11.0-cdh4.6.0 java.lang.ClassCastException:…
1
vote
1 answer

How to convert NaN values into zeroes in Pig

I am trying to convert NaN into zeroes using Pig scripting as below but I keep getting an error message. Can someone share your thoughts on how to handle NaN's in PIG.Any insights would be appreciated. Thank you. My input field xyz::abcd has…
Teja
  • 13,214
  • 36
  • 93
  • 155