Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions

votes

1 answer

Exception in type casting Chararry to double in PIG

I have a sample input as tab separated key, value pair as follows B_1001@2012-06-15 96.73429163933419@0.5511284347710459 B_1001@2012-06-18 187.4348199976547@0.5544551559243536 B_1002@2012-09-26 …

hadoop mapreduce apache-pig

asked Sep 26 '13 at 14:26

sudheer

votes

3 answers

How to flatten a group into a single tuple in Pig?

From this: (1, {(1,2), (1,3), (1,4)} ) (2, {(2,5), (2,6), (2,7)} ) ...How could we generate this? ((1,2),(1,3),(1,4)) ((2,5),(2,6),(2,7)) ...And how could we generate this? (1, 2, 3, 4) (2, 5, 6, 7) For a single row I know how to do. The problem…

hadoop apache-pig

asked Aug 31 '13 at 04:48

user2730009

votes

2 answers

Transform bag of key-value tuples to map in Apache Pig

I am new to Pig and I want to convert a bag of tuples to a map with specific value in each tuple as key. Basically I want to change: {(id1, value1),(id2, value2), ...} into [id1#value1, id2#value2] I've been looking around online for a while, but I…

dictionary apache-pig

asked Jul 25 '13 at 02:23

Penguinator

votes

1 answer

How to : Python UDF dictionary return schema in PIG

What is the output schema to return a dictionary from Python UDF while using Apache PIG. I have a dictionary of dictionaries, something like this: dict = {x:{a:1,b:2,c:3}, y:{d:1,e:3,f:9}} and my output schema looks…

python dictionary schema user-defined-functions apache-pig

asked Nov 12 '12 at 19:55

user1620334

votes

2 answers

Difference between PIG local and mapreduce mode

What is the actual difference between running PIG scripts locally and on mapreduce? I understand mapreduce mode is when you run it on a cluster that has hdfs installed. Does this mean local mode does not need HDFS and so even mapreduce jobs don't…

hadoop mapreduce hdfs apache-pig

asked Jul 26 '12 at 12:33

London guy

27,522
44
121
179

votes

2 answers

Storing results of UNION in PIG in a single file

I have a PIG Script which produces four results I want to store all of them in a single file. I tries using UNION, however when I use UNION I get four files part-m-00000, part-m-00001, part-m-00002, part-m-00003. Cant I get a single file? Here is…

hadoop apache-pig hdfs

asked Jun 08 '12 at 19:20

Uno

votes

4 answers

Generating all fields from an alias after a JOIN in Pig

I would like to perform the equivalent of "keep all a in A where a.field == b.field for some b in B" in Apache Pig. I am implementing it like so, AB_joined = JOIN A by field, B by field; A2 = FOREACH AB_joined GENERATE A::field as field, A::field2…

hadoop apache-pig

asked May 30 '12 at 23:23

duckworthd

14,679
16
53
68

votes

1 answer

Projecting Grouped Tuples in Pig

I have a collection of tuples of the form (t,a,b) that I want to group by b in Pig. Once grouped, I want to filter out b from the tuples in each group and generate a bag of filtered tuples per group. As an example, assume we…

apache-pig

asked May 29 '12 at 23:39

Chris

3,109
7
29
39

votes

3 answers

How do I make Hadoop find imported Python modules when using Python UDFs in Pig?

I am using Pig (0.9.1) with UDFs written in Python. The Python scripts import modules from the standard Python library. I have been able to run the Pig scrips that call the Python UDFs successfully in local mode, but when I run on the cluster it…

python hadoop jython apache-pig

asked Oct 20 '11 at 05:47

Ben Lever

2,023
7
26
34

votes

1 answer

Filter a string on the basis of a word

I have a pig job where in I need to filter the data by finding a word in it, Here is the snippet A = LOAD '/home/user/filename' USING PigStorage(','); B = FOREACH A GENERATE $27,$38; C = FILTER B BY ( $1 == '*Word*'); STORE C INTO '/home/user/out1'…

hadoop apache-pig

asked Sep 16 '11 at 13:58

learner

votes

2 answers

Apache Pig permissions issue

I'm attempting to get Apache Pig up and running on my Hadoop cluster, and am encountering a permissions problem. Pig itself is launching and connecting to the cluster just fine- from within the Pig shell, I can ls through and around my HDFS…

permissions hadoop apache-pig hdfs

asked Aug 25 '11 at 16:38

Steven Bedrick

votes

1 answer

AWS EMR import external library from S3

I have setup a cluster using Amazon EMR. I have a python library (cloned from github and not available on pip) on S3. I want to submit a pig work that uses a udf which makes use of the library present in S3. I don't want to add the library to the…

python amazon-web-services amazon-s3 apache-pig amazon-emr

asked Aug 07 '16 at 02:42

Madhavan Malolan

votes

6 answers

Reference manual for Apache Pig Latin

Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin. Does anyone know of a good reference manual for PigLatin? I'm looking for something that includes all the syntax and commands descriptions…

apache-pig dataflow manual

asked Dec 15 '08 at 13:57

Ori lahav

votes

5 answers

How to change Tez job name when running query in HIVE

When I submit a Hive SQL using Tez like below: hive (default)> select count(*) from simple_data; In Resource Manager UI the job name shows something like HIVE-9d1906a2-25dd-4a7c-9ea3-bf651036c7eb Is there a way to change the job name…

hadoop hive apache-pig

asked Oct 29 '15 at 19:14

khussain

votes

2 answers

How can I debug a pig script

If while running a simple group by script in pig for large terabytes of data, the script got stuck at say 70%, then what can be done to diagnose the problem?

hadoop apache-pig bigdata

asked May 12 '15 at 18:14

Manish

Prev 1 2 3

…

99 100 Next