Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions

votes

2 answers

Hadoop Pig - Removing csv header

My csv files have header in the first line. Loading them into pig create a mess on any subsequent functions (like SUM). As of today I first apply a filter on the loaded data to remove the rows containing the headers : affaires = load…

csv hadoop apache-pig

asked Mar 29 '15 at 22:24

Romain Jouin

4,448
3
49
79

votes

1 answer

Pig 0.13 ERROR 2998: Unhandled internal error. org/apache/hadoop/mapreduce/task/JobContextImpl

Just installed Pig 0.13 and I am attempting to use it with Hadoop 1.1.2. (Pig documentation states Pig 0.13 is compatible with Hadoop 1.1.2). Per the Pig install instructions, I set $PIG_CLASSPATH to point at /etc/hadoop where core-site.xml,…

hadoop apache-pig

asked Aug 03 '14 at 18:23

Jon Firuz

votes

2 answers

Usage of Apache Pig rank function

Am using Pig 0.11.0 rank function and generating ranks for every id in my data. I need ranking of my data in a particular way. I want the rank to reset and start from 1 for every new ID. Is it possible to use the rank function directly for the…

apache-pig

asked Apr 10 '14 at 11:42

Yash Sharma

1,674
2
16
23

votes

2 answers

Apache Sqoop/Pig Consistent Data Representation/Processing

In our organization, we have been trying to use hadoop ecosystem based tools to implement ETLs lately. Although the ecosystem itself is quite big, we are using only a very limited set of tools at the moment. Our typical pipeline flow is as…

apache-pig sqoop

asked Feb 25 '14 at 23:30

srikrishna

votes

1 answer

what is the distinction between an 'outer bag' and an 'inner bag' in pigLatin?

the manual/documentation uses the language of 'inner bag' and 'outer bag' extensively (say: http://pig.apache.org/docs/r0.11.1/basic.html ), and yet I haven't been able to pin out clearly the precise definition separating the terms. e.g. all…

apache-pig

asked Oct 08 '13 at 01:27

Matt S.

votes

3 answers

Junit External Resource @Rule Order

I want to use multiple external resources in my test class, but I have a problem with ordering of external resources. Here is code snippet : public class TestPigExternalResource { // hadoop external resource, this should start first …

java hadoop junit apache-pig rule

asked Oct 04 '13 at 07:11

Salih Kardan

votes

2 answers

pig to hadoop issue: Server IPC version 7 cannot communicate with client version 4

I am trying to get pig started and failing: $ pig 2013-05-10 18:03:22,972 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53 2013-05-10 18:03:22,972 [main] INFO org.apache.pig.Main - Logging…

hadoop apache-pig

asked May 10 '13 at 22:15

barclay

4,362
9
48
68

votes

1 answer

Exit pig shell command safely

When I enter some erroneous command in a Pig interactive shell environment, it enters into listening mode (>>) like below. How do I safely come out of this command, but still stay in the pig shell environment? Ctrl + C takes me out of the pig shell…

hadoop apache-pig

asked Mar 17 '13 at 03:02

Sid

votes

4 answers

How can I add a header row to files created from Pig (Hadoop)?

I'm writing a pig latin script similar to the following: A = load 'data' using PigStorage('\t'); store A into my_data using PigStorage(); This outputs (Bob, 10, 4.0) (Jim, 11, 3.25) (Paul, 9, 2.75) I'd like to add a first header row to each file…

hadoop apache-pig

asked Jan 07 '13 at 21:24

Ryan Guest

6,080
2
33
39

votes

2 answers

Calculate count of distinct values of a field using pig script

For a file of the form A B user1 C D user2 A D user3 A D user1 I want to calculate the count of distinct values of field 3 i.e. count(distinct(user1, user2,user2,user1)) = 3 I am doing this using the following pig script A = load 'myTestData'…

hadoop apache-pig

asked Oct 15 '12 at 11:25

Netra M

votes

2 answers

Apache Pig: Load a file that shows fine using hadoop fs -text

I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000 but can't get them loaded using pig. What I've tried: x = load 'part-r-00000'; dump x; x = load 'part-r-00000'…

linux hadoop apache-pig cloudera

asked Sep 05 '12 at 17:34

exic

2,220
1
22
29

votes

6 answers

Installing PIG on single node

I installed Hadoop (1.0.2) for a single node on Windows 7 with Cygwin, and it is working. However, I cannot get PIG (0.10.0) to see the Hadoop. 1) "Error: JAVA_HOME is not set." I added this line to pig (under bin): export…

hadoop apache-pig

asked Jul 13 '12 at 11:46

cuneyt

votes

2 answers

Debugging in PIG UDF

I am new to Hadoop/PIG. I have a basic question. Do we have a Logging facility in PIG UDF? I have written a UDF which I need to verify I need to log certain statements to check the flow. Is there a Logging facility available? If yes where are the…

hadoop apache-pig hdfs

asked Jun 12 '12 at 21:17

Uno

votes

1 answer

How to compute sum of a field in all the rows from an alias

What I want to do is to sum values of a field in all rows in an alias. This must be simple but somehow I can't find the answer. This is probably because what I want is a scalar value while PIG handles datasets? I guess I can create a row with a…

hadoop apache-pig

asked Mar 27 '12 at 22:37

kee

10,969
24
107
168

votes

1 answer

How can I partition a table with HIVE?

I've been playing with Hive for few days now but I still have a hard time with partition. I've been recording Apache logs (Combine format) in Hadoop for few months. They are stored in row text format, partitioned by date (via…

hadoop mapreduce hive apache-pig

asked Mar 08 '12 at 23:36

zzarbi

1,832
3
15
29

Prev 1 2 3

…

99 100 Next