0

I wrote a simple PIG program as follows to analyze a small and a modified version of the google n-grams dataset on AWS. The data looks something like this:

I am 1936 942 90
I am 1945 811 5
I am 1951 47 12
very cool 1923 118 10
very cool 1980 320 100
very cool 2012 994 302
very cool 2017 1820 612

and has the form:

n-gram TAB year TAB occurrences TAB books NEWLINE

I wrote the following program to calculate the occurences of an ngram per book:

inp = LOAD <insert input path here> AS (ngram:chararray, year:int, occurences:int, books:int);
filter_input = FILTER inp BY (occurences >= 400) AND (books >= 8);
groupinp = GROUP filter_input BY ngram;
sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(occurences) AS socc , SUM(books) AS nbooks;

DUMP sum_occ;

However, the DUMP command does not work and gives the following error:

892520 [main] INFO  org.apache.pig.tools.pigstats.ScriptState  - Pig features used in the script: GROUP_BY,FILTER
18/03/28 00:56:09 INFO pigstats.ScriptState: Pig features used in the script: GROUP_BY,FILTER
1892554 [main] INFO  org.apache.pig.data.SchemaTupleBackend  - Key [pig.schematuple] was not set... will not generate code.
18/03/28 00:56:09 INFO data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
1892555 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer  - {RULES_ENABLED=[ConstantCalculator, LoadTypeCastInserter, PredicatePushdownOptimizer, StreamTypeCastInserter], RULES_DISABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter]}
18/03/28 00:56:09 INFO optimizer.LogicalPlanOptimizer: {RULES_ENABLED=[ConstantCalculator, LoadTypeCastInserter, PredicatePushdownOptimizer, StreamTypeCastInserter], RULES_DISABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter]}
1892591 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher  - Tez staging directory is /tmp/temp383666093 and resources directory is /tmp/temp383666093
18/03/28 00:56:09 INFO tez.TezLauncher: Tez staging directory is /tmp/temp383666093 and resources directory is /tmp/temp383666093
1892592 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.plan.TezCompiler  - File concatenation threshold: 100 optimistic? false
18/03/28 00:56:09 INFO plan.TezCompiler: File concatenation threshold: 100 optimistic? false
1892593 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.AccumulatorOptimizerUtil  - Reducer is to run in accumulative mode.
18/03/28 00:56:09 INFO util.AccumulatorOptimizerUtil: Reducer is to run in accumulative mode.
1892606 [main] INFO  org.apache.pig.builtin.PigStorage  - Using PigTextInputFormat
18/03/28 00:56:09 INFO builtin.PigStorage: Using PigTextInputFormat
18/03/28 00:56:09 INFO input.FileInputFormat: Total input files to process : 1
1892626 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths to process : 1
18/03/28 00:56:09 INFO util.MapRedUtil: Total input paths to process : 1
1892627 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths (combined) to process : 1
18/03/28 00:56:09 INFO util.MapRedUtil: Total input paths (combined) to process : 1
18/03/28 00:56:09 INFO hadoop.MRInputHelpers: NumSplits: 1, SerializedSize: 408
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: joda-time-2.9.4.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: joda-time-2.9.4.jar
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: pig-0.17.0-core-h2.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: pig-0.17.0-core-h2.jar
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: antlr-runtime-3.4.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: antlr-runtime-3.4.jar
1892653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: automaton-1.11-8.jar
18/03/28 00:56:09 INFO tez.TezJobCompiler: Local resource: automaton-1.11-8.jar
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - For vertex - scope-239: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
18/03/28 00:56:09 INFO tez.TezDagBuilder: For vertex - scope-239: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Processing aliases: filter_input,groupinp,inp
18/03/28 00:56:09 INFO tez.TezDagBuilder: Processing aliases: filter_input,groupinp,inp
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],groupinp[3,11]
18/03/28 00:56:09 INFO tez.TezDagBuilder: Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],groupinp[3,11]
1892709 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Pig features in the vertex: 
18/03/28 00:56:09 INFO tez.TezDagBuilder: Pig features in the vertex: 
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Set auto parallelism for vertex scope-240
18/03/28 00:56:09 INFO tez.TezDagBuilder: Set auto parallelism for vertex scope-240
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - For vertex - scope-240: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
18/03/28 00:56:09 INFO tez.TezDagBuilder: For vertex - scope-240: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Processing aliases: sum_occ
18/03/28 00:56:09 INFO tez.TezDagBuilder: Processing aliases: sum_occ
1892744 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Detailed locations: sum_occ[5,10]
18/03/28 00:56:09 INFO tez.TezDagBuilder: Detailed locations: sum_occ[5,10]
1892745 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Pig features in the vertex: GROUP_BY
18/03/28 00:56:09 INFO tez.TezDagBuilder: Pig features in the vertex: GROUP_BY
1892762 [main] ERROR org.apache.pig.tools.grunt.Grunt  - ERROR 2017: Internal error creating job configuration.
18/03/28 00:56:09 ERROR grunt.Grunt: ERROR 2017: Internal error creating job configuration.
Details at logfile: /mnt/var/log/pig/pig_1522196676602.log

How do I fix this?

thegreatcoder
  • 2,173
  • 3
  • 19
  • 28
  • What do the logs at `/mnt/var/log/pig/pig_1522196676602.log` say? As an side, the error isn't with `DUMP`, it's just that no execution will actually take place until you say `DUMP`, as that's the first time you tell Pig to explicitly return something. – Ben Watson Mar 28 '18 at 08:44
  • What separates ngrams "I am" and "very cool"? space or tab? If there is tab then your data will not be loaded correctly. – nobody Mar 28 '18 at 14:40
  • @Ben Watson: How do I navigate to the log? Is it stored locally? Or on the bucket in the cloud? Please guide me. – thegreatcoder Mar 28 '18 at 15:44
  • @VK_217 It's a space. – thegreatcoder Mar 28 '18 at 15:45
  • Should be local. – Ben Watson Mar 28 '18 at 15:46
  • Checked the logs and it said something like this: `2018-03-27 22:54:02,506 ERROR org.apache.pig.tools.grunt.Grunt (main): ERROR 1000: Error during parsing. Lexical error at line 10, column 0. Encountered: after : "" 2018-03-27 22:54:02,559 INFO org.apache.pig.Main (main): Pig script completed in 10 seconds and 445 milliseconds (10445 ms)` – thegreatcoder Mar 28 '18 at 16:14
  • whats the statement at line 10 in your pig script? – nobody Mar 28 '18 at 16:26
  • There is no tenth line. And no space after the last DUMP sum_occ; – thegreatcoder Mar 28 '18 at 16:28

3 Answers3

1

If you are using an old version, kindly update it (should solve your problem)

PIG scripts are lazily evaluated, so unless you use a DUMP or STORE command you will not know what is wrong with your code.

When you run your code it will again throw the following error:

ERROR 1025: Invalid field projection. Projected field [occurences] does not exist in schema: group:chararray,filter_input:bag{:tuple(ngram:chararray,year:int,occurences:int,books:int)}.

Change the below line from

sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(occurences) AS socc , SUM(books) AS nbooks;

to

sum_occ = FOREACH groupinp GENERATE FLATTEN(group) AS firstcol, SUM(filter_input.occurences) AS socc, SUM(filter_input.books) AS nbooks;

will solve this error.

hprakash
  • 452
  • 2
  • 10
0

I don't have enough reputation for making the comment, so writing it here. My guess is you have unclosed quote. What do you have at "insert input path here" part? Is the path enclosed with single quotes?

Koji
  • 409
  • 4
  • 4
  • It is within single quotes. I just removed the path name as I did not want to mention in explicitly in a public forum. – thegreatcoder Mar 30 '18 at 17:45
  • The pathname is there in the original code, enclosed within single quotes. – thegreatcoder Mar 30 '18 at 17:46
  • hmm, then unless you have extra comments (-- or /* */) not shared in the description , I don't have an answer. – Koji Mar 30 '18 at 19:05
  • If you do have a comment in your script, you may want to check https://issues.apache.org/jira/browse/PIG-4818 and make sure you don't have quotes in your comments. – Koji Mar 30 '18 at 19:24
0

Not having enough reputations to comment so posting here, are writing the above pig statements in a script or running individually from grunt shell. Also can you give a brief about the logic behind sum_occ relation.

Rajnil Guha
  • 425
  • 1
  • 4
  • 15