Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
13
votes
1 answer

Load only particular field in PIG?

This is my file: Col1, Col2, Col3, Col4, Col5 I need only Col2 and Col3. Currently I'm doing this: a = load 'input' as (Col1:chararray, Col2:chararray, Col3:chararray, …
ComputerFellow
  • 11,710
  • 12
  • 50
  • 61
13
votes
6 answers

How to perform a DISTINCT in Pig Latin on a subset of columns?

I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach: You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the…
Freerobots
  • 772
  • 1
  • 6
  • 20
13
votes
3 answers

Calculate Average using PIG

I am new to PIG and want to calculate Average of my one column data that looks like 0 10.1 20.1 30 40 50 60 70 80.1 I wrote this pig script dividends = load 'myfile.txt' as (A); dump dividends grouped = group dividends by A; avg = foreach…
user1792899
12
votes
1 answer

In spark join, does table order matter like in pig?

Related to Spark - Joining 2 PairRDD elements When doing a regular join in pig, the last table in the join is not brought into memory but streamed through instead, so if A has small cardinality per key and B large cardinality, it is significantly…
ihadanny
  • 4,377
  • 7
  • 45
  • 76
12
votes
4 answers

How do I suppress the bloat of useless information when using the DUMP command while using grunt via 'pig -x local'?

I'm working with PigLatin, using grunt, and every time I 'dump' stuffs, my console gets clobbered with blah blah, blah non-info, is there a way to surpress all that? grunt> A = LOAD 'testingData' USING PigStorage(':'); dump A; 2013-05-06…
Matt S.
  • 878
  • 10
  • 21
11
votes
1 answer

using PIG to load a file

I am very new to PIG and I am having what feels like a very basic problem. I have a line of code that reads: A = load 'Sites/trial_clustering/shortdocs/*' AS (word1:chararray, word2:chararray, word3:chararray, word4:chararray); where each…
YuliaPro
  • 305
  • 1
  • 7
  • 16
11
votes
2 answers

strsplit issue - Pig

I have following tuple H1 and I want to strsplit its $0 into tuple.However I always get an error message: DUMP H1: (item32;item31;,1) m = FOREACH H1 GENERATE STRSPLIT($0, ";", 50); ERROR 1000: Error during parsing. Lexical error at line 1,…
ohana
  • 285
  • 1
  • 5
  • 20
11
votes
5 answers

A way to export the results from Pig to a database

Is there a way to export the results from Pig directly to a database like mysql?
Christoph
  • 1,113
  • 5
  • 17
  • 35
11
votes
3 answers

ERROR 1066: Unable to open iterator for alias in Pig, Generic solution

A very common, error message in Apache Pig is: ERROR 1066: Unable to open iterator for alias There are several questions where this error is mentioned, but none of them give a generic approach for dealing with it. Hence this question: What to do…
Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
11
votes
2 answers

Counting elements for each group using Pig

I'm trying to group and count the frequency of terms for each group in PigLatin, but I'm having some troubles to figure it out how to do it. I have a collection of objects with the following schema: {cluster_id: bytearray,terms: chararray} And…
Arian Pasquali
  • 432
  • 2
  • 6
  • 17
11
votes
1 answer

Hadoop, Hive, Pig, HBase, Cassandra - when to use what?

First of all I am relatively new to Big Data and the Hadoop world and I have just started to experiment a little with the Hortonworks Sandbox (Pig and Hive so far). I was wondering in which cases could I use the above mentioned tools of Hadoop,…
Daniel
  • 2,409
  • 2
  • 26
  • 42
11
votes
6 answers

Error in pig while loading data

I am using ubuntu 12.02 32bit and have installed hadoop2.2.0 and pig 0.12 successfully. Hadoop runs properly on my system. However, whenever I run this command : data = load 'atoz.csv' using PigStorage(',') as (aa1:int, bb1:int, cc1:int,…
Hardik Barot
  • 775
  • 1
  • 4
  • 17
11
votes
7 answers

GUI for using Hadoop

Is there an easy way to use Hadoop other than with the command line? Which tools are you using and which one is the best?
Rami
  • 111
  • 1
  • 1
  • 4
11
votes
2 answers

Still getting "Unable to load realm info from SCDynamicStore" after bug fix

I installed Hadoop and Pig using brew install hadoop and brew install pig. I read here that you will to get Unable to load realm info from SCDynamicStore error message unless you add: export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK…
FilmiHero
  • 2,306
  • 7
  • 31
  • 46
11
votes
2 answers

Export from pig to CSV

I'm having a lot of trouble getting data out of pig and into a CSV that I can use in Excel or SQL (or R or SPSS etc etc) without a lot of manipulation ... I've tried using the following function: STORE pig_object INTO…
Saxivore
  • 123
  • 1
  • 1
  • 7