I am running gobblin to move data from kafka to s3 using 3 node EMR cluster. I am running on hadoop 2.6.0 and I also built gobblin against 2.6.0.
It seems like map-reduce job runs successfully. On my hdfs i see metrics and working directory. metrics has some files but working directory is empty. S3 bucket should have had final output but has no data. And at the end it says
Output task state path /gooblinOutput/working/GobblinKafkaQuickStart_mapR3/output/job_GobblinKafkaQuickStart_mapR3_1460132596498 does not exist Deleted working directory /gooblinOutput/working/GobblinKafkaQuickStart_mapR3
Here are final logs :
2016-04-08 16:23:26 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1366 - Job job_1460065322409_0002 running in uber mode : false
2016-04-08 16:23:26 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 0% reduce 0%
2016-04-08 16:23:32 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 10% reduce 0%
2016-04-08 16:23:33 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 40% reduce 0%
2016-04-08 16:23:34 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 60% reduce 0%
2016-04-08 16:23:36 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 80% reduce 0%
2016-04-08 16:23:37 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1373 - map 100% reduce 0%
2016-04-08 16:23:38 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1384 - Job job_1460065322409_0002 completed successfully
2016-04-08 16:23:38 UTC INFO [main] org.apache.hadoop.mapreduce.Job 1391 - Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1276095
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=28184
HDFS: Number of bytes written=41960
HDFS: Number of read operations=60
HDFS: Number of large read operations=0
HDFS: Number of write operations=11
Job Counters
Launched map tasks=10
Other local map tasks=10
Total time spent by all maps in occupied slots (ms)=1828125
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=40625
Total vcore-seconds taken by all map tasks=40625
Total megabyte-seconds taken by all map tasks=58500000
Map-Reduce Framework
Map input records=10
Map output records=0
Input split bytes=2150
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=296
CPU time spent (ms)=10900
Physical memory (bytes) snapshot=2715054080
Virtual memory (bytes) snapshot=18852671488
Total committed heap usage (bytes)=4729077760
File Input Format Counters
Bytes Read=6444
File Output Format Counters
Bytes Written=0
2016-04-08 16:23:38 UTC INFO [TaskStateCollectorService STOPPING] gobblin.runtime.TaskStateCollectorService 101 - Stopping the TaskStateCollectorService
2016-04-08 16:23:38 UTC WARN [TaskStateCollectorService STOPPING] gobblin.runtime.TaskStateCollectorService 123 - Output task state path /gooblinOutput/working/GobblinKafkaQuickStart_mapR3/output/job_GobblinKafkaQuickStart_mapR3_1460132596498 does not exist
2016-04-08 16:23:38 UTC INFO [main] gobblin.runtime.mapreduce.MRJobLauncher 443 - Deleted working directory /gooblinOutput/working/GobblinKafkaQuickStart_mapR3
2016-04-08 16:23:38 UTC INFO [main] gobblin.util.ExecutorsUtils 125 - Attempting to shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@6c257d54[Shutting down, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
2016-04-08 16:23:38 UTC INFO [main] gobblin.util.ExecutorsUtils 144 - Successfully shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@6c257d54[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
2016-04-08 16:23:38 UTC INFO [main] gobblin.runtime.app.ServiceBasedAppLauncher 158 - Shutting down the application
2016-04-08 16:23:38 UTC INFO [MetricsReportingService STOPPING] gobblin.util.ExecutorsUtils 125 - Attempting to shutdown ExecutorService: java.util.concurrent.Executors$DelegatedScheduledExecutorService@5584dbb6
2016-04-08 16:23:38 UTC INFO [MetricsReportingService STOPPING] gobblin.util.ExecutorsUtils 144 - Successfully shutdown ExecutorService: java.util.concurrent.Executors$DelegatedScheduledExecutorService@5584dbb6
2016-04-08 16:23:38 UTC WARN [Thread-7] gobblin.runtime.app.ServiceBasedAppLauncher 153 - ApplicationLauncher has already stopped
2016-04-08 16:23:38 UTC WARN [Thread-4] gobblin.metrics.reporter.ContextAwareReporter 116 - Reporter MetricReportReporter has already been stopped.
2016-04-08 16:23:38 UTC WARN [Thread-4] gobblin.metrics.reporter.ContextAwareReporter 116 - Reporter MetricReportReporter has already been stopped.
Here are my conf files :
gobblin-mapreduce.properties
# Thread pool settings for the task executor
taskexecutor.threadpool.size=2
taskretry.threadpool.coresize=1
taskretry.threadpool.maxsize=2
# File system URIs
fs.uri=hdfs://{host}:8020
writer.fs.uri=${fs.uri}
state.store.fs.uri=s3a://{bucket}/gobblin-mapr/
# Writer related configuration properties
writer.destination.type=HDFS
writer.output.format=AVRO
writer.staging.dir=${env:GOBBLIN_WORK_DIR}/task-staging
writer.output.dir=${env:GOBBLIN_WORK_DIR}/task-output
# Data publisher related configuration properties
data.publisher.type=gobblin.publisher.BaseDataPublisher
data.publisher.final.dir=${env:GOBBLIN_WORK_DIR}/job-output
data.publisher.replace.final.dir=false
# Directory where job/task state files are stored
state.store.dir=${env:GOBBLIN_WORK_DIR}/state-store
# Directory where error files from the quality checkers are stored
qualitychecker.row.err.file=${env:GOBBLIN_WORK_DIR}/err
# Directory where job locks are stored
job.lock.dir=${env:GOBBLIN_WORK_DIR}/locks
# Directory where metrics log files are stored
metrics.log.dir=${env:GOBBLIN_WORK_DIR}/metrics
# Interval of task state reporting in milliseconds
task.status.reportintervalinms=5000
# MapReduce properties
mr.job.root.dir=${env:GOBBLIN_WORK_DIR}/working
# s3 bucket configuration
data.publisher.fs.uri=s3a://{bucket}/gobblin-mapr/
fs.s3a.access.key={key}
fs.s3a.secret.key={key}
FILE 2 : kafka-to-s3.pull
job.name=GobblinKafkaQuickStart_mapR3
job.group=GobblinKafka_mapR3
job.description=Gobblin quick start job for Kafka
job.lock.enabled=false
kafka.brokers={kafka-host}:9092
topic.whitelist={topic_name}
source.class=gobblin.source.extractor.extract.kafka.KafkaSimpleSource
extract.namespace=gobblin.extract.kafka
writer.builder.class=gobblin.writer.SimpleDataWriterBuilder
writer.file.path.type=tablename
writer.destination.type=HDFS
writer.output.format=txt
data.publisher.type=gobblin.publisher.BaseDataPublisher
mr.job.max.mappers=10
bootstrap.with.offset=latest
metrics.reporting.file.enabled=true
metircs.enabled=true
metrics.reporting.file.suffix=txt
Running commands
export GOBBLIN_WORK_DIR=/gooblinOutput
Command : bin/gobblin-mapreduce.sh --conf /home/hadoop/gobblin-files/gobblin-dist/kafkaConf/kafka-to-s3.pull --logdir /home/hadoop/gobblin-files/gobblin-dist/logs
Not sure whats going on. Can someone please help?