Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
8
votes
1 answer

Exception in type casting Chararry to double in PIG

I have a sample input as tab separated key, value pair as follows B_1001@2012-06-15 96.73429163933419@0.5511284347710459 B_1001@2012-06-18 187.4348199976547@0.5544551559243536 B_1002@2012-09-26 …
sudheer
  • 338
  • 1
  • 6
  • 17
8
votes
3 answers

How to flatten a group into a single tuple in Pig?

From this: (1, {(1,2), (1,3), (1,4)} ) (2, {(2,5), (2,6), (2,7)} ) ...How could we generate this? ((1,2),(1,3),(1,4)) ((2,5),(2,6),(2,7)) ...And how could we generate this? (1, 2, 3, 4) (2, 5, 6, 7) For a single row I know how to do. The problem…
user2730009
  • 117
  • 1
  • 1
  • 6
8
votes
2 answers

Transform bag of key-value tuples to map in Apache Pig

I am new to Pig and I want to convert a bag of tuples to a map with specific value in each tuple as key. Basically I want to change: {(id1, value1),(id2, value2), ...} into [id1#value1, id2#value2] I've been looking around online for a while, but I…
Penguinator
  • 641
  • 3
  • 11
  • 22
8
votes
1 answer

How to : Python UDF dictionary return schema in PIG

What is the output schema to return a dictionary from Python UDF while using Apache PIG. I have a dictionary of dictionaries, something like this: dict = {x:{a:1,b:2,c:3}, y:{d:1,e:3,f:9}} and my output schema looks…
8
votes
2 answers

Difference between PIG local and mapreduce mode

What is the actual difference between running PIG scripts locally and on mapreduce? I understand mapreduce mode is when you run it on a cluster that has hdfs installed. Does this mean local mode does not need HDFS and so even mapreduce jobs don't…
London guy
  • 27,522
  • 44
  • 121
  • 179
8
votes
2 answers

Storing results of UNION in PIG in a single file

I have a PIG Script which produces four results I want to store all of them in a single file. I tries using UNION, however when I use UNION I get four files part-m-00000, part-m-00001, part-m-00002, part-m-00003. Cant I get a single file? Here is…
Uno
  • 533
  • 10
  • 24
8
votes
4 answers

Generating all fields from an alias after a JOIN in Pig

I would like to perform the equivalent of "keep all a in A where a.field == b.field for some b in B" in Apache Pig. I am implementing it like so, AB_joined = JOIN A by field, B by field; A2 = FOREACH AB_joined GENERATE A::field as field, A::field2…
duckworthd
  • 14,679
  • 16
  • 53
  • 68
8
votes
1 answer

Projecting Grouped Tuples in Pig

I have a collection of tuples of the form (t,a,b) that I want to group by b in Pig. Once grouped, I want to filter out b from the tuples in each group and generate a bag of filtered tuples per group. As an example, assume we…
Chris
  • 3,109
  • 7
  • 29
  • 39
7
votes
3 answers

How do I make Hadoop find imported Python modules when using Python UDFs in Pig?

I am using Pig (0.9.1) with UDFs written in Python. The Python scripts import modules from the standard Python library. I have been able to run the Pig scrips that call the Python UDFs successfully in local mode, but when I run on the cluster it…
Ben Lever
  • 2,023
  • 7
  • 26
  • 34
7
votes
1 answer

Filter a string on the basis of a word

I have a pig job where in I need to filter the data by finding a word in it, Here is the snippet A = LOAD '/home/user/filename' USING PigStorage(','); B = FOREACH A GENERATE $27,$38; C = FILTER B BY ( $1 == '*Word*'); STORE C INTO '/home/user/out1'…
learner
  • 885
  • 3
  • 14
  • 28
7
votes
2 answers

Apache Pig permissions issue

I'm attempting to get Apache Pig up and running on my Hadoop cluster, and am encountering a permissions problem. Pig itself is launching and connecting to the cluster just fine- from within the Pig shell, I can ls through and around my HDFS…
Steven Bedrick
  • 663
  • 2
  • 8
  • 16
7
votes
1 answer

AWS EMR import external library from S3

I have setup a cluster using Amazon EMR. I have a python library (cloned from github and not available on pip) on S3. I want to submit a pig work that uses a udf which makes use of the library present in S3. I don't want to add the library to the…
7
votes
6 answers

Reference manual for Apache Pig Latin

Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin. Does anyone know of a good reference manual for PigLatin? I'm looking for something that includes all the syntax and commands descriptions…
Ori lahav
  • 125
  • 1
  • 2
7
votes
5 answers

How to change Tez job name when running query in HIVE

When I submit a Hive SQL using Tez like below: hive (default)> select count(*) from simple_data; In Resource Manager UI the job name shows something like HIVE-9d1906a2-25dd-4a7c-9ea3-bf651036c7eb Is there a way to change the job name…
khussain
  • 133
  • 2
  • 8
7
votes
2 answers

How can I debug a pig script

If while running a simple group by script in pig for large terabytes of data, the script got stuck at say 70%, then what can be done to diagnose the problem?
Manish
  • 186
  • 2
  • 2
  • 8