Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
6
votes
2 answers

Cassandras Map Reduce Support

I recently ran into a case where Cassandra fits in perfectly to store time based events with custom ttls per event type (the other solution would be to save it in hadoop and do the bookkeeping manually (ttls and stuff, IMHO a very complex idea) or…
Tobias
  • 583
  • 3
  • 15
6
votes
3 answers

Can I pass parameters to UDFs in Pig script?

I am relatively new to PigScript. I would like to know if there is a way of passing parameters to Java UDFs in Pig? Here is the scenario: I have a log file which have different columns (each representing a Primary Key in another table). My task is…
emkay
  • 169
  • 2
  • 7
6
votes
4 answers

I have an Errno 13 Permission denied with subprocess in python

The line with the issue is ret=subprocess.call(shlex.split(cmd)) cmd = /usr/share/java -cp pig-hadoop-conf-Simpsons:lib/pig-0.8.1-cdh3u1-core.jar:lib/hadoop-core-0.20.2-cdh3u1.jar org.apache.pig.Main -param func=cat -param from =foo.txt -x…
wDroter
  • 1,209
  • 4
  • 17
  • 25
6
votes
4 answers

finding mean using pig or hadoop

I have a huge text file of form data is saved in directory data/data1.txt, data2.txt and so on merchant_id, user_id, amount 1234, 9123, 299.2 1233, 9199, 203.2 1234, 0124, 230 and so on.. What I want to do is for each merchant, find the average…
frazman
  • 32,081
  • 75
  • 184
  • 269
6
votes
1 answer

Group key value of map in pig

I am new to pigscript. Say, We have a file [a#1,b#2,c#3] [a#4,b#5,c#6] [a#7,b#8,c#9] pig script A = LOAD 'txt' AS (in: map[]); B = FOREACH A GENERATE in#'a'; DUMP B; We know that we can take the values feeding in the key. In the above example I…
Logan
  • 1,331
  • 3
  • 18
  • 41
6
votes
1 answer

Pig: apply a FOREACH operator to each element within a bag

Example: I have a relation "class", with a nested bag of students: class: {teacher_name: chararray,students: {(firstname: chararray, lastname: chararray)} I want to perform an operation on each student, while leaving the global structure…
Zorglub
  • 2,077
  • 1
  • 19
  • 22
6
votes
1 answer

A way to read table data from Mysql to Pig

Everyone know that Pig have supported DBStorage, but they are only supported load results from Pig to mysql like that STORE data INTO DBStorage('com.mysql.jdbc.Driver', 'dbc:mysql://host/db', 'INSERT ...'); But Please show me the way to read…
phuongdo
  • 271
  • 1
  • 4
  • 9
6
votes
3 answers

Flatten tuple like a bag

My dataset looks like the following: ( A, (1,2) ) ( B, (2,9) ) I would like to "flatten" the tuples in Pig, basically repeating each record for each value found in the inner-tuple, such that the expected output is: ( A, 1 ) ( A, 2 ) ( B, 2 ) ( B,…
syker
  • 10,912
  • 16
  • 56
  • 68
5
votes
2 answers

How to store grouped records into multiple files with Pig?

After loading and grouping records, how can I store those grouped records into several files, one per group (=userid)? records = LOAD 'input' AS (userid:int, ...); grouped_records = GROUP records BY userid; I'm using Apache Pig version 0.8.1-cdh3u3…
thomers
  • 2,603
  • 4
  • 29
  • 50
5
votes
8 answers

How can I add row numbers for rows in PIG or HIVE?

I have a problem when adding row numbers using Apache Pig. The problem is that I have a STR_ID column and I want to add a ROW_NUM column for the data in STR_ID, which is the row number of the STR_ID. For example, here is the…
Breakinen
  • 619
  • 2
  • 7
  • 13
5
votes
0 answers

Write data that can be read by ProtobufPigLoader from Elephant Bird

For a project of mine, I want to analyse around 2 TB of Protobuf objects. I want to consume these objects in a Pig Script via the "elephant bird" library. However it is not totally clear to my how to write a file to HDFS so that it can be consumed…
dmeister
  • 34,704
  • 19
  • 73
  • 95
5
votes
1 answer

Pig Order By Query

grunt> dump jn; (k1,k4,10) (k1,k5,15) (k2,k4,9) (k3,k4,16) grunt> jn = group jn by $1; grunt> dump jn; (k4,{(k1,k4,10),(k2,k4,9),(k3,k4,16)}) (k5,{(k1,k5,15)}) Now, from here I want the following output…
simplfuzz
  • 12,479
  • 24
  • 84
  • 137
5
votes
1 answer

file formats that can be read using PIG

What kind of file formats can be read using PIG? How can I store them in different formats? Say we have CSV file and I want to store it as MXL file how this can be done? Whenever we use STORE command it makes directory and it stores file as…
Chhaya Vishwakarma
  • 1,407
  • 9
  • 44
  • 72
5
votes
1 answer

Using Distributed Cache with Pig on Elastic Map Reduce

I am trying to run my Pig script (which uses UDFs) on Amazon's Elastic Map Reduce. I need to use some static files from within my UDFs. I do something like this in my UDF: public class MyUDF extends EvalFunc { public DataBag exec(Tuple…
Vivek Pandey
  • 3,455
  • 1
  • 19
  • 25
5
votes
2 answers

Using Pig/Hive for data processing instead of direct java map reduce code?

(Even more basic than Difference between Pig and Hive? Why have both?) I have a data processing pipeline written in several Java map-reduce tasks over Hadoop (my own custom code, derived from Hadoop's Mapper and Reducer). It's a series of basic…
ihadanny
  • 4,377
  • 7
  • 45
  • 76