Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
18
votes
4 answers

Computing median in map reduce

Can someone example the computation of median/quantiles in map reduce? My understanding of Datafu's median is that the 'n' mappers sort the data and send the data to "1" reducer which is responsible for sorting all the data from n mappers and…
learner
  • 885
  • 3
  • 14
  • 28
17
votes
3 answers

select count distinct using pig latin

I need help with this pig script. I am just getting a single record. I am selecting 2 columns and doing a count(distinct) on another while also using a where like clause to find a particular description (desc). Here's my sql with pig I am trying to…
jdamae
  • 3,839
  • 16
  • 58
  • 78
17
votes
4 answers

Connection Error in Apache Pig

I am running Apache Pig .11.1 with Hadoop 2.0.5. Most simple jobs that I run in Pig work perfectly fine. However, whenever I try to use GROUP BY on a large dataset, or the LIMIT operator, I get these connection errors: 2013-07-29 13:24:08,591 [main]…
Andy Botelho
  • 741
  • 1
  • 7
  • 14
17
votes
2 answers

Pig: Get top n values per group

I have data that's already grouped and aggregated, it looks like so: user value count ---- -------- ------ Alice third 5 Alice first 11 Alice second 10 Alice fourth 2 ... Bob second 20 Bob third …
Hoff
  • 38,776
  • 17
  • 74
  • 99
17
votes
2 answers

How to force STORE (overwrite) to HDFS in Pig?

When developing Pig scripts that use the STORE command I have to delete the output directory for every run or the script stops and offers: 2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: Output Location Validation…
valid
  • 1,858
  • 1
  • 18
  • 28
16
votes
5 answers

Skipping the header while loading the text file using Piglatin

I have a text file and it's first row contains the header. Now I want to do some operation on the data, but while loading the file using PigStorage it takes the HEADER too. I just want to skip the HEADER. Is it possible to do so(directly or through…
Pawan Kumar
  • 522
  • 4
  • 9
  • 29
16
votes
5 answers

Is there any Conditional IF like operator in Apache PIG?

Actually I am writing PIG Script and want to execute some set of statements if one of the condition is satisfied. I have set one variable and checking for some value of that variable. Suppose if flag==0 then A = LOAD 'file' using PigStorage() as…
Bhavesh Shah
  • 3,299
  • 11
  • 49
  • 73
15
votes
3 answers

How to use Cassandra's Map Reduce with or w/o Pig?

Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client"…
Brent
  • 23,354
  • 10
  • 44
  • 49
15
votes
1 answer

Find if a string is present inside another string in Pig

I want to find if a string contains another string in Pig. I found that there is a built-in index function, but it only searches for characters not strings. Is there any other alternative?
Sudar
  • 18,954
  • 30
  • 85
  • 131
14
votes
2 answers

STORE output to a single CSV?

Currently, when I STORE into HDFS, it creates many part files. Is there any way to store out to a single CSV file?
JasonA
  • 314
  • 2
  • 4
  • 11
14
votes
4 answers

How can I incorporate the current input filename into my Pig Latin script?

I am processing data from a set of files which contain a date stamp as part of the filename. The data within the file does not contain the date stamp. I would like to process the filename and add it to one of the data structures within the script.…
Kevin Fink
  • 151
  • 1
  • 1
  • 7
14
votes
6 answers

What is the best Pig plugin for Eclipse?

I'm about to start playing around with PIG-latin, and I was hoping to get some text highlighting and such for it in Eclipse. Doing a quick Google search, I saw a couple of Eclipse plugins for it. Are they all still in development? Which is the best?
Eli
  • 36,793
  • 40
  • 144
  • 207
14
votes
4 answers

Filtering null values with pig

It looks like a silly problem, but I can´t find a way to filter null values from my rows. This is the result when I dump the object geoinfo: DUMP geoinfo; ([longitude#70.95853,latitude#30.9773]) ([longitude#-9.37944507,latitude#38.91780853]) …
Arian Pasquali
  • 432
  • 2
  • 6
  • 17
14
votes
2 answers

Define tuple datas in the pig script

I am currently debugging a pig script. I'd like to define a tuple in the Pig file directly (instead of the basic "Load" function). Is there a way to do it? I am looking for something like that: A= ('name#bob'','age#29';'name#paul','age#12') The…
romain-nio
  • 1,183
  • 9
  • 25
13
votes
1 answer

Join vs COGROUP in PIG

Are there any advantages (wrt performance / no of map reduces ) when i use COGROUP instead of JOIN in pig ? http://developer.yahoo.com/hadoop/tutorial/module6.html talks about the difference in the type of output they produce. But, ignoring the…
raj
  • 3,769
  • 4
  • 25
  • 43