Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
1
vote
1 answer

Left outer join on more than 2 relations at a time in PIG

I am trying to perform a left outer join for more that 2 relations in a single statement in pig. Is it possible? Regards Harish
1
vote
0 answers

Pig-Attempt to access non existing field

Problem: Dumping filtered output throws an error and prints incorrect output with warnings: Error-attempt to access non-existing field in input Steps: Loaded a tab-delimited file into relation a: a = LOAD…
1
vote
1 answer

LOAD csv file in PigLatin

I'm trying to load a csv file in PigLatin. Record format is as follows: "ABBOTT,DEEDEE W",GRADES 9-12 TEACHER,"52,122.10",0,LBOE,ATLANTA INDEPENDENT SCHOOL SYSTEM,2010 I tried the following code: A = LOAD '/user/hduser/salaryTravel.csv' using…
Niyas
  • 505
  • 1
  • 6
  • 17
1
vote
1 answer

Ingesting large files into Hive on a single node Hadoop

I want to ingest large csv files(up to 6 GB) on a regular basis into a Hadoop single node with 32 GB RAM. They key requirement is to register the data in HCatalog. (Please do not discuss requirements, it is a functional demo). Performance is not…
Stefan Papp
  • 2,199
  • 1
  • 28
  • 54
1
vote
0 answers

Error writing to Hive table with HCatStorer()

I'm currently pulling data from a hive table over S3 with HCatLoader(), and attempting to write back out to a hive table over S3 with HCatStorer(). I'm using the default Hive install that comes baked into AWS EMR. HCatLoader works fine, and I can…
MattClark
  • 11
  • 2
1
vote
3 answers

Move file from local to HDFS

My environment uses Spark, Pig and Hive. I am having some trouble to write a code in Scala (or any other language compatible with my environment) that could copy a file from a local file system to HDFS. Does anyone have any advices on how I should…
Shakile
  • 343
  • 2
  • 5
  • 13
1
vote
2 answers

Pig-Is there any maximum number of columns for which FILTER command be applied?

I am having an input file which contains 952 columns. I would like to have a pig script which will check for schema not being altered. If altered, my script should fail. This is important because if the columns are altered or missing, my other pig…
1
vote
1 answer

Unable to run Pig latin script on Apache Tez

I am having a pseudo-distributed single cluster Ubuntu machine. I have written a simple pig latin script which runs fine while using mapreduce as execution mode. But when i use tez as excution mode using -x switch then i got following…
infiQuanta
  • 116
  • 8
1
vote
0 answers

pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.DBStorage'

I'm trying to output a pig script with 3 fields to a PostgreSQL database. When I dump the output, the script works fine. However when I use the DBStorage() method: register /$directory/postgresql9.4-1201.jdbc41.jar; register…
zaralleru
  • 11
  • 2
1
vote
2 answers

Removing duplicates using PigLatin and retaining the last element

I am using PigLatin. And I want to remove the duplicates from the bags and want to retain the last element of the particular key. Input: User1 7 LA User1 8 NYC User1 9 NYC User2 3 NYC User2 4 DC Output: User1 9 NYC User2 4 DC Here…
Anil Savaliya
  • 129
  • 1
  • 1
  • 6
1
vote
1 answer

embedded pig error when running on pig 15 on Hadoop 2

Whenever i run any apache pig code from the terminal everythig goes well and i get the result. So i conclude that my installation for Pig 0.15.0 and Hadoop 2.7.0 is alright. The problem is when i run the pigServer from inside java code: PigServer…
Abdulrahman
  • 433
  • 4
  • 11
1
vote
1 answer

Multipy after joining data in PIG

I am trying to multiply two fields and take their sum after joining three tables in Pig. However I keep on getting this error: (Name: Multiply Type: null Uid: null)incompatible types in Multiply…
harshvardhan.agr
  • 165
  • 1
  • 12
1
vote
0 answers

How do I flatten nested Avro records in a Pig query?

Avro schema looks like this: { "type" : "record", "name" : "name1", "fields" : [ { "name" : "f1", "type" : "string" }, { "name" : "f2", "type" : { "type" : "array", "items" : …
Vikas
  • 8,790
  • 4
  • 38
  • 48
1
vote
1 answer

Reading array of strings from file with Apache Pig

I'm storing a Hive table externally, and it's a pretty simple data structure. The table is created in Hive as (user string, names array) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\001' STORED AS…
JayC
  • 238
  • 2
  • 9
1
vote
1 answer

ERROR 2999: Unexpected internal error. java.net.URISyntaxException: Relative path in absolute URI

pig -param CURR_TS=`date "+%F %H:%M:%S"` -f pig_script.pig After running this i am getting below Error - ERROR 2999: Unexpected internal error. java.net.URISyntaxException: Relative path in absolute URI: 04:36:33 I know the problem is with ":"…
Indrajeet Gour
  • 4,020
  • 5
  • 43
  • 70