Gobblin Map-reduce job running successfully on EMR but no output in s3

Question

I am running gobblin to move data from kafka to s3 using 3 node EMR cluster. I am running on hadoop 2.6.0 and I also built gobblin against 2.6.0.

It seems like map-reduce job runs successfully. On my hdfs i see metrics and working directory. metrics has some files but working directory is empty. S3 bucket should have had final output but has no data. And at the end it says

Output task state path /gooblinOutput/working/GobblinKafkaQuickStart_mapR3/output/job_GobblinKafkaQuickStart_mapR3_1460132596498 does not exist Deleted working directory /gooblinOutput/working/GobblinKafkaQuickStart_mapR3

Here are final logs :

2016-04-08 16:23:26 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1366 -      Job job_1460065322409_0002 running in uber mode : false
2016-04-08 16:23:26 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 0% reduce 0%
2016-04-08 16:23:32 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 10% reduce 0%
2016-04-08 16:23:33 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 40% reduce 0%
2016-04-08 16:23:34 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 60% reduce 0%
2016-04-08 16:23:36 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 80% reduce 0%
2016-04-08 16:23:37 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1373 -  map 100% reduce 0%
2016-04-08 16:23:38 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1384 -      Job job_1460065322409_0002 completed successfully
2016-04-08 16:23:38 UTC INFO  [main] org.apache.hadoop.mapreduce.Job  1391 -      Counters: 30
    File System Counters
     FILE: Number of bytes read=0
    FILE: Number of bytes written=1276095
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=28184
    HDFS: Number of bytes written=41960
    HDFS: Number of read operations=60
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=11
Job Counters
    Launched map tasks=10
    Other local map tasks=10
    Total time spent by all maps in occupied slots (ms)=1828125
    Total time spent by all reduces in occupied slots (ms)=0
    Total time spent by all map tasks (ms)=40625
    Total vcore-seconds taken by all map tasks=40625
    Total megabyte-seconds taken by all map tasks=58500000
Map-Reduce Framework
    Map input records=10
    Map output records=0
    Input split bytes=2150
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=296
    CPU time spent (ms)=10900
    Physical memory (bytes) snapshot=2715054080
    Virtual memory (bytes) snapshot=18852671488
    Total committed heap usage (bytes)=4729077760
File Input Format Counters
    Bytes Read=6444
File Output Format Counters
    Bytes Written=0
2016-04-08 16:23:38 UTC INFO  [TaskStateCollectorService STOPPING]    gobblin.runtime.TaskStateCollectorService  101 - Stopping the    TaskStateCollectorService
2016-04-08 16:23:38 UTC WARN  [TaskStateCollectorService STOPPING] gobblin.runtime.TaskStateCollectorService  123 - Output task state path /gooblinOutput/working/GobblinKafkaQuickStart_mapR3/output/job_GobblinKafkaQuickStart_mapR3_1460132596498 does not exist
2016-04-08 16:23:38 UTC INFO  [main] gobblin.runtime.mapreduce.MRJobLauncher  443 - Deleted working directory /gooblinOutput/working/GobblinKafkaQuickStart_mapR3
2016-04-08 16:23:38 UTC INFO  [main] gobblin.util.ExecutorsUtils  125 - Attempting to shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@6c257d54[Shutting down, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
 2016-04-08 16:23:38 UTC INFO  [main] gobblin.util.ExecutorsUtils  144 - Successfully shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@6c257d54[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]  
 2016-04-08 16:23:38 UTC INFO  [main] gobblin.runtime.app.ServiceBasedAppLauncher  158 - Shutting down the application
 2016-04-08 16:23:38 UTC INFO  [MetricsReportingService STOPPING] gobblin.util.ExecutorsUtils  125 - Attempting to shutdown ExecutorService: java.util.concurrent.Executors$DelegatedScheduledExecutorService@5584dbb6
 2016-04-08 16:23:38 UTC INFO  [MetricsReportingService STOPPING] gobblin.util.ExecutorsUtils  144 - Successfully shutdown ExecutorService: java.util.concurrent.Executors$DelegatedScheduledExecutorService@5584dbb6
 2016-04-08 16:23:38 UTC WARN  [Thread-7] gobblin.runtime.app.ServiceBasedAppLauncher  153 - ApplicationLauncher has already stopped
 2016-04-08 16:23:38 UTC WARN  [Thread-4] gobblin.metrics.reporter.ContextAwareReporter  116 - Reporter MetricReportReporter has already been stopped.
 2016-04-08 16:23:38 UTC WARN  [Thread-4] gobblin.metrics.reporter.ContextAwareReporter  116 - Reporter MetricReportReporter has already been stopped.

Here are my conf files :

gobblin-mapreduce.properties

# Thread pool settings for the task executor
taskexecutor.threadpool.size=2
taskretry.threadpool.coresize=1
taskretry.threadpool.maxsize=2

# File system URIs
fs.uri=hdfs://{host}:8020
writer.fs.uri=${fs.uri}
state.store.fs.uri=s3a://{bucket}/gobblin-mapr/

# Writer related configuration properties
writer.destination.type=HDFS
writer.output.format=AVRO
writer.staging.dir=${env:GOBBLIN_WORK_DIR}/task-staging
writer.output.dir=${env:GOBBLIN_WORK_DIR}/task-output

# Data publisher related configuration properties 
data.publisher.type=gobblin.publisher.BaseDataPublisher
data.publisher.final.dir=${env:GOBBLIN_WORK_DIR}/job-output
data.publisher.replace.final.dir=false

# Directory where job/task state files are stored
state.store.dir=${env:GOBBLIN_WORK_DIR}/state-store

# Directory where error files from the quality checkers are stored
qualitychecker.row.err.file=${env:GOBBLIN_WORK_DIR}/err

# Directory where job locks are stored
job.lock.dir=${env:GOBBLIN_WORK_DIR}/locks

# Directory where metrics log files are stored
metrics.log.dir=${env:GOBBLIN_WORK_DIR}/metrics

# Interval of task state reporting in milliseconds
task.status.reportintervalinms=5000

# MapReduce properties
mr.job.root.dir=${env:GOBBLIN_WORK_DIR}/working


# s3 bucket configuration

data.publisher.fs.uri=s3a://{bucket}/gobblin-mapr/
fs.s3a.access.key={key}
fs.s3a.secret.key={key}

FILE 2 : kafka-to-s3.pull

job.name=GobblinKafkaQuickStart_mapR3
job.group=GobblinKafka_mapR3
job.description=Gobblin quick start job for Kafka
job.lock.enabled=false

kafka.brokers={kafka-host}:9092
topic.whitelist={topic_name}

source.class=gobblin.source.extractor.extract.kafka.KafkaSimpleSource
extract.namespace=gobblin.extract.kafka

writer.builder.class=gobblin.writer.SimpleDataWriterBuilder
writer.file.path.type=tablename
writer.destination.type=HDFS
writer.output.format=txt

data.publisher.type=gobblin.publisher.BaseDataPublisher

mr.job.max.mappers=10
bootstrap.with.offset=latest

metrics.reporting.file.enabled=true
metircs.enabled=true
metrics.reporting.file.suffix=txt

Running commands

export GOBBLIN_WORK_DIR=/gooblinOutput
Command : bin/gobblin-mapreduce.sh --conf /home/hadoop/gobblin-files/gobblin-dist/kafkaConf/kafka-to-s3.pull --logdir /home/hadoop/gobblin-files/gobblin-dist/logs

Not sure whats going on. Can someone please help?

score 0 · Answer 1 · answered Apr 21 '16 at 18:03

There were 2 issues

I had data.publisher.final.dir=${env:GOBBLIN_WORK_DIR}/job-output

It should have been something like s3a://dev.com/gobblin-mapr6/

and then I somehow got special characters added to topic.whitelist. So, it was not able to recognize the topic

Gobblin Map-reduce job running successfully on EMR but no output in s3

1 Answers1