Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
6
votes
3 answers

What Are the Pros and Cons of Running a Job in Hadoop Using Various Languages?

I've been using either Pig or Java for Map Reduce exclusively for running jobs against a Hadoop cluster thus far. I've recently tried out using Python Map Reduce through the Hadoop streaming and that was pretty cool as well. All of these make sense…
Eli
  • 36,793
  • 40
  • 144
  • 207
6
votes
1 answer

Declaring a variable and schema in PIG

how to declare a variable in PIG? suppose i want to have a integer to have values as 10 how can i declare it in script? and how schema can be reused ?
Chhaya Vishwakarma
  • 1,407
  • 9
  • 44
  • 72
6
votes
2 answers

generating an id/counter for foreach in pig latin

I want some sort of unique identifier/line_number/counter to be generated/appended in my foreach construct while iterates through the records. Is there a way to accomplish this without writing a UDF? B = foreach A generate a_unique_id,…
pranay
  • 444
  • 1
  • 7
  • 20
6
votes
3 answers

How Can I Load Every File In a Folder Using PIG?

I have a folder of files created daily that all store the same type of information. I'd like to make a script that loads the newest 10 of them, UNIONs them, and then runs some other code on them. Since pig already has an ls method, I was wondering…
Eli
  • 36,793
  • 40
  • 144
  • 207
6
votes
1 answer

How to Get Pig to Work with lzo Files?

So, I've seen a couple of tutorials for this online, but each seems to say to do something different. Also, each of them doesn't seem to specify whether you're trying to get things to work on a remote cluster, or to locally interact with a remote…
Eli
  • 36,793
  • 40
  • 144
  • 207
6
votes
1 answer

Running Pig query over data stored in Hive

I would like to know how to run Pig queries stored in Hive format. I have configured Hive to store compressed data (using this tutorial http://wiki.apache.org/hadoop/Hive/CompressedStorage). Before that I used to just use normal Pig load function…
wlk
  • 5,695
  • 6
  • 54
  • 72
6
votes
2 answers

How do you deal with empty or missing input files in Apache Pig?

Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is potentially inconsistent, and can result in either no input files or 0 byte…
Chris Phillips
  • 11,607
  • 3
  • 34
  • 45
6
votes
1 answer

pig is not visible inside hue

I have a hadoop cluster. Pig is installed: But the pig editor is not visible inside hue (3.7): How can I fix it?
rom
  • 3,592
  • 7
  • 41
  • 71
6
votes
3 answers

Convert "3" to 3 with PigLatin

I read in a csv-file that contains fields with numbers like that: "3". Can I convert this fields from "3" to 3 with PigLatin? I need it to use the SUM() - Function. Thanks for your help!
Christoph
  • 1,113
  • 5
  • 17
  • 35
6
votes
2 answers

Pig: is it possible to use pytz or dateutils for Python udfs?

I am using datetime in some Python udfs that I use in my pig script. So far so good. I use pig 12.0 on Cloudera 5.5 However, I also need to use the pytz or dateutil packages as well and they dont seem to be part of a vanilla python install. Can I…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
6
votes
5 answers

Is there a canonical problem that provably can't be aided with map/reduce?

I'm trying to understand the boundaries of hadoop and map/reduce and it would help to know a non-trivial problem, or class of problems, that we know map/reduce can't assist in. It certainly would be interesting if changing one factor of the problem…
Steven Noble
  • 10,204
  • 13
  • 45
  • 57
6
votes
3 answers

Regexp matching in pig

Using apache pig and the text hahahah. my brother just didnt do anything wrong. He cheated on a test? no way! I'm trying to match "my brother just didnt do anything wrong." Ideally, I'd want to match anything beginning with "my brother just" and…
Neil Kodner
  • 2,901
  • 3
  • 27
  • 36
6
votes
3 answers

Getting an error on running HCatalog

A = LOAD 'eventnew.txt' USING HCatalogLoader(); 2015-07-08 19:56:34,875 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve HCatalogLoader using imports: [, java.lang., org.apache.pig.builtin.,…
HeadBanger'
  • 63
  • 1
  • 3
6
votes
1 answer

Does throwing an exception in an EvalFunc pig UDF skip just that line, or stop completely?

I have a User Defined Function (UDF) written in Java to parse lines in a log file and return information back to pig, so it can do all the processing. It looks something like this: public abstract class Foo extends EvalFunc { public Foo()…
Daniel Huckstep
  • 5,368
  • 10
  • 40
  • 56
6
votes
2 answers

HIVE Creating Table not null

this is my query in DB2 Database: CREATE TABLE MY_TABLE (COD_SOC CHAR(5) NOT NULL); Is possible reproduce the 'NOT NULL' in HIVE? What about PIG?
Edge7
  • 681
  • 1
  • 15
  • 35