Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
7
votes
2 answers

Hadoop Pig - Removing csv header

My csv files have header in the first line. Loading them into pig create a mess on any subsequent functions (like SUM). As of today I first apply a filter on the loaded data to remove the rows containing the headers : affaires = load…
Romain Jouin
  • 4,448
  • 3
  • 49
  • 79
7
votes
1 answer

Pig 0.13 ERROR 2998: Unhandled internal error. org/apache/hadoop/mapreduce/task/JobContextImpl

Just installed Pig 0.13 and I am attempting to use it with Hadoop 1.1.2. (Pig documentation states Pig 0.13 is compatible with Hadoop 1.1.2). Per the Pig install instructions, I set $PIG_CLASSPATH to point at /etc/hadoop where core-site.xml,…
Jon Firuz
  • 115
  • 7
7
votes
2 answers

Usage of Apache Pig rank function

Am using Pig 0.11.0 rank function and generating ranks for every id in my data. I need ranking of my data in a particular way. I want the rank to reset and start from 1 for every new ID. Is it possible to use the rank function directly for the…
Yash Sharma
  • 1,674
  • 2
  • 16
  • 23
7
votes
2 answers

Apache Sqoop/Pig Consistent Data Representation/Processing

In our organization, we have been trying to use hadoop ecosystem based tools to implement ETLs lately. Although the ecosystem itself is quite big, we are using only a very limited set of tools at the moment. Our typical pipeline flow is as…
srikrishna
  • 238
  • 3
  • 11
7
votes
1 answer

what is the distinction between an 'outer bag' and an 'inner bag' in pigLatin?

the manual/documentation uses the language of 'inner bag' and 'outer bag' extensively (say: http://pig.apache.org/docs/r0.11.1/basic.html ), and yet I haven't been able to pin out clearly the precise definition separating the terms. e.g. all…
Matt S.
  • 878
  • 10
  • 21
7
votes
3 answers

Junit External Resource @Rule Order

I want to use multiple external resources in my test class, but I have a problem with ordering of external resources. Here is code snippet : public class TestPigExternalResource { // hadoop external resource, this should start first …
Salih Kardan
  • 579
  • 1
  • 6
  • 16
7
votes
2 answers

pig to hadoop issue: Server IPC version 7 cannot communicate with client version 4

I am trying to get pig started and failing: $ pig 2013-05-10 18:03:22,972 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53 2013-05-10 18:03:22,972 [main] INFO org.apache.pig.Main - Logging…
barclay
  • 4,362
  • 9
  • 48
  • 68
7
votes
1 answer

Exit pig shell command safely

When I enter some erroneous command in a Pig interactive shell environment, it enters into listening mode (>>) like below. How do I safely come out of this command, but still stay in the pig shell environment? Ctrl + C takes me out of the pig shell…
Sid
  • 217
  • 2
  • 11
7
votes
4 answers

How can I add a header row to files created from Pig (Hadoop)?

I'm writing a pig latin script similar to the following: A = load 'data' using PigStorage('\t'); store A into my_data using PigStorage(); This outputs (Bob, 10, 4.0) (Jim, 11, 3.25) (Paul, 9, 2.75) I'd like to add a first header row to each file…
Ryan Guest
  • 6,080
  • 2
  • 33
  • 39
7
votes
2 answers

Calculate count of distinct values of a field using pig script

For a file of the form A B user1 C D user2 A D user3 A D user1 I want to calculate the count of distinct values of field 3 i.e. count(distinct(user1, user2,user2,user1)) = 3 I am doing this using the following pig script A = load 'myTestData'…
Netra M
  • 71
  • 1
  • 1
  • 2
7
votes
2 answers

Apache Pig: Load a file that shows fine using hadoop fs -text

I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000 but can't get them loaded using pig. What I've tried: x = load 'part-r-00000'; dump x; x = load 'part-r-00000'…
exic
  • 2,220
  • 1
  • 22
  • 29
7
votes
6 answers

Installing PIG on single node

I installed Hadoop (1.0.2) for a single node on Windows 7 with Cygwin, and it is working. However, I cannot get PIG (0.10.0) to see the Hadoop. 1) "Error: JAVA_HOME is not set." I added this line to pig (under bin): export…
cuneyt
  • 336
  • 5
  • 15
7
votes
2 answers

Debugging in PIG UDF

I am new to Hadoop/PIG. I have a basic question. Do we have a Logging facility in PIG UDF? I have written a UDF which I need to verify I need to log certain statements to check the flow. Is there a Logging facility available? If yes where are the…
Uno
  • 533
  • 10
  • 24
6
votes
1 answer

How to compute sum of a field in all the rows from an alias

What I want to do is to sum values of a field in all rows in an alias. This must be simple but somehow I can't find the answer. This is probably because what I want is a scalar value while PIG handles datasets? I guess I can create a row with a…
kee
  • 10,969
  • 24
  • 107
  • 168
6
votes
1 answer

How can I partition a table with HIVE?

I've been playing with Hive for few days now but I still have a hard time with partition. I've been recording Apache logs (Combine format) in Hadoop for few months. They are stored in row text format, partitioned by date (via…
zzarbi
  • 1,832
  • 3
  • 15
  • 29