Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions

vote

1 answer

How to self-join two bags?

I have some set of numbers that describes connections between the first set of integers and the second set of integers. For example: 1,2 3,4 5,6 5,7 6,8 I then load my data as follows, and group it: data = load 'data.csv' as integer_1,…

mapreduce apache-pig

asked Jul 16 '15 at 19:49

orange1

2,871
3
32
58

vote

1 answer

Reading a CSV File in Pig

I am using Cloudera CDH3 Pseudo mode Cluster. In CDH3 The Pig Version is 0.8 I would like to read a CSV or Excel File Using Pig script I downloaded piggybank-0.11.0.jar and kept it inside /home/cloudera/ directory my csv file is like this.. id …

apache-pig

asked Jul 16 '15 at 17:18

Surender Raja

3,553
8
44
80

vote

0 answers

Pig script join multiple keys with or

Hi I have been trying to run this piece of pig script. meddata = join meddata by (icddiag1 or icddiag2) left outer, codes by code; with the intent of joining the meddata file if icddiag1 matches codes or if icddiag2 matches codes. I know I can…

join apache-pig

asked Jul 16 '15 at 03:56

Ramji

vote

1 answer

Removal Double Quote(") from CSV file using PIG

I am trying to remove double quotes(") from file.Some of the field has data like "Newyork,NY". Please advice me what to do?I have tried to delete (") from CSV.But it is not happening.Stepwise Codes are given below: I am opening pig using pig -x…

hadoop apache-pig

asked Jul 14 '15 at 14:24

dipayan

vote

1 answer

Pig - Remove embedded newlines and commas in gzip files

I have a gzip file with data field separated by commas. I am currently using PigStorage to load the file as shown below: A = load 'myfile.gz' USING PigStorage(',') AS (id,date,text); The data in the gzip file has embedded characters - embedded…

regex csv apache-pig

asked Jul 13 '15 at 21:52

activelearner

7,055
20
53
94

vote

1 answer

Pig UDF in java :Error ---ERROR 1066: Unable to open iterator for alias

I am new to Pig My input data is (message,NIL,2015-07-01,22:58:53.66,E,machine.com.name,12,0xd6,String,String ,0,0.0,key=value&key=123456789&key=value&key=US&key=COMPANY&key=MESSAGE&key=123456789&key=String&key=String&Key=String&Key=String) I…

java runtime-error apache-pig udf

asked Jul 10 '15 at 09:36

Divya

vote

0 answers

PIG latin-- Special Character handling - Å -- capital A with ring above

I want to check if a column contains some set of special characters and do evaluation further, I was able to do it for most of the spl chars but not for one particular spl character. Has anyone come across and handled this spl char- Å? I even used…

regex hadoop apache-pig

asked Jul 10 '15 at 06:23

mercuryman

vote

0 answers

Extending Apache PigStorage

I am working with data where entries can be split across two lines. I would like to extend PigStorage to support multiline entries but have questions about how that is done. Can I override PigStorage in a standard Java UDF, or do I have to modify…

java hadoop apache-pig user-defined-functions

asked Jul 07 '15 at 18:22

kira_codes

1,457
13
38

vote

1 answer

PIG : Columns into rows

I have a file contains this : id_v^id_f^id_s1,id_s2,id_s3,id_s4 id_v1^id_f1^id_s2,id_s3,id_s4 id_v2^id_f2^id_s2,id_s1,id_s4 this file is a "^" delimited csv. i want to normalise it this way using pig…

apache-pig cloudera-cdh

asked Jul 07 '15 at 15:00

Firas kh

vote

2 answers

Pig Nested Example

I have written this following pig script. How can I make this a nested one? input= LOAD '/path/to/input/data' USING PigStorage('\t') AS (id:chararray,category:chararray); grp= GROUP input BY category; grp_count= FOREACH grp generate group,…

apache-pig

asked Jul 01 '15 at 17:42

biswadeep

vote

1 answer

JOIN condition in PIG Latin

SQL SELECT m.x,m.y,n.a,n.b from mydata1 m,mydata2 n WHERE m.x=n.a AND m.y>= n.y PIG A = LOAD 'mydata1' AS (x: int, y: datetime); B = LOAD 'mydata2' AS (a: int, b: datetime); I now need to join both the tables using the above sql condition. How…

hadoop apache-pig

asked Jul 01 '15 at 11:15

user131990

vote

1 answer

Java dependencies in Pig UDF

I wrote a UDF which uses Joda Time. I included it as a dependency in pom.xml. When I run my pig script I get the error ERROR 2998: Unhandled internal error. org.joda.time.LocalDate.parse(Ljava/lang/String;)Lorg/joda/time/LocalDate; I am pretty new…

java hadoop apache-pig dependency-management udf

asked Jun 27 '15 at 18:54

ManuelSchneid3r

15,850
12
65
103

vote

1 answer

Speed up Hive or Pig aggregation by using pre-sorted data

I want to speed up a simple Apache Hive (0.13.1) or Pig (version 0.12.0) aggregation job on Amazon EMR. My data is already sorted on the key that needs to be aggregated and I want the jobs to make use of that. Hive: [..some 'set' calls…

hive apache-pig emr

asked Jun 27 '15 at 15:26

Daniel Naber

1,594
12
19

vote

2 answers

Java wrong major version

On our hadoop cluster my Pig UDF fails complaining [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1069: Problem resolving class version numbers for class I read writing a udf in pig kind of like tutorial and the problem seams…

java hadoop apache-pig

asked Jun 26 '15 at 23:48

ManuelSchneid3r

15,850
12
65
103

vote

1 answer

Hive not detecting timestamp format

I have a PIG script that Loads and transforms the data from a csv Replaces some characters Calls a java program (JAR) to convert the date-time in csv from 06/02/2015 18:52 to 2015-6-2 18:52 (mm/DD/yyyy to yyyy-MM-dd) REGISTER…

date hadoop hive apache-pig cloudera

asked Jun 25 '15 at 13:17

Santosh Sulibhavi

Prev 1 2 3

…

100