Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
1
vote
1 answer

How to self-join two bags?

I have some set of numbers that describes connections between the first set of integers and the second set of integers. For example: 1,2 3,4 5,6 5,7 6,8 I then load my data as follows, and group it: data = load 'data.csv' as integer_1,…
orange1
  • 2,871
  • 3
  • 32
  • 58
1
vote
1 answer

Reading a CSV File in Pig

I am using Cloudera CDH3 Pseudo mode Cluster. In CDH3 The Pig Version is 0.8 I would like to read a CSV or Excel File Using Pig script I downloaded piggybank-0.11.0.jar and kept it inside /home/cloudera/ directory my csv file is like this.. id …
Surender Raja
  • 3,553
  • 8
  • 44
  • 80
1
vote
0 answers

Pig script join multiple keys with or

Hi I have been trying to run this piece of pig script. meddata = join meddata by (icddiag1 or icddiag2) left outer, codes by code; with the intent of joining the meddata file if icddiag1 matches codes or if icddiag2 matches codes. I know I can…
Ramji
  • 11
  • 2
1
vote
1 answer

Removal Double Quote(") from CSV file using PIG

I am trying to remove double quotes(") from file.Some of the field has data like "Newyork,NY". Please advice me what to do?I have tried to delete (") from CSV.But it is not happening.Stepwise Codes are given below: I am opening pig using pig -x…
dipayan
  • 72
  • 9
1
vote
1 answer

Pig - Remove embedded newlines and commas in gzip files

I have a gzip file with data field separated by commas. I am currently using PigStorage to load the file as shown below: A = load 'myfile.gz' USING PigStorage(',') AS (id,date,text); The data in the gzip file has embedded characters - embedded…
activelearner
  • 7,055
  • 20
  • 53
  • 94
1
vote
1 answer

Pig UDF in java :Error ---ERROR 1066: Unable to open iterator for alias

I am new to Pig My input data is (message,NIL,2015-07-01,22:58:53.66,E,machine.com.name,12,0xd6,String,String ,0,0.0,key=value&key=123456789&key=value&key=US&key=COMPANY&key=MESSAGE&key=123456789&key=String&key=String&Key=String&Key=String) I…
Divya
  • 95
  • 1
  • 9
1
vote
0 answers

PIG latin-- Special Character handling - Å -- capital A with ring above

I want to check if a column contains some set of special characters and do evaluation further, I was able to do it for most of the spl chars but not for one particular spl character. Has anyone come across and handled this spl char- Å? I even used…
mercuryman
  • 11
  • 3
1
vote
0 answers

Extending Apache PigStorage

I am working with data where entries can be split across two lines. I would like to extend PigStorage to support multiline entries but have questions about how that is done. Can I override PigStorage in a standard Java UDF, or do I have to modify…
kira_codes
  • 1,457
  • 13
  • 38
1
vote
1 answer

PIG : Columns into rows

I have a file contains this : id_v^id_f^id_s1,id_s2,id_s3,id_s4 id_v1^id_f1^id_s2,id_s3,id_s4 id_v2^id_f2^id_s2,id_s1,id_s4 this file is a "^" delimited csv. i want to normalise it this way using pig…
Firas kh
  • 13
  • 3
1
vote
2 answers

Pig Nested Example

I have written this following pig script. How can I make this a nested one? input= LOAD '/path/to/input/data' USING PigStorage('\t') AS (id:chararray,category:chararray); grp= GROUP input BY category; grp_count= FOREACH grp generate group,…
biswadeep
  • 11
  • 2
1
vote
1 answer

JOIN condition in PIG Latin

SQL SELECT m.x,m.y,n.a,n.b from mydata1 m,mydata2 n WHERE m.x=n.a AND m.y>= n.y PIG A = LOAD 'mydata1' AS (x: int, y: datetime); B = LOAD 'mydata2' AS (a: int, b: datetime); I now need to join both the tables using the above sql condition. How…
user131990
  • 85
  • 1
  • 7
1
vote
1 answer

Java dependencies in Pig UDF

I wrote a UDF which uses Joda Time. I included it as a dependency in pom.xml. When I run my pig script I get the error ERROR 2998: Unhandled internal error. org.joda.time.LocalDate.parse(Ljava/lang/String;)Lorg/joda/time/LocalDate; I am pretty new…
ManuelSchneid3r
  • 15,850
  • 12
  • 65
  • 103
1
vote
1 answer

Speed up Hive or Pig aggregation by using pre-sorted data

I want to speed up a simple Apache Hive (0.13.1) or Pig (version 0.12.0) aggregation job on Amazon EMR. My data is already sorted on the key that needs to be aggregated and I want the jobs to make use of that. Hive: [..some 'set' calls…
Daniel Naber
  • 1,594
  • 12
  • 19
1
vote
2 answers

Java wrong major version

On our hadoop cluster my Pig UDF fails complaining [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1069: Problem resolving class version numbers for class I read writing a udf in pig kind of like tutorial and the problem seams…
ManuelSchneid3r
  • 15,850
  • 12
  • 65
  • 103
1
vote
1 answer

Hive not detecting timestamp format

I have a PIG script that Loads and transforms the data from a csv Replaces some characters Calls a java program (JAR) to convert the date-time in csv from 06/02/2015 18:52 to 2015-6-2 18:52 (mm/DD/yyyy to yyyy-MM-dd) REGISTER…
1 2 3
99
100