3

I'm working on a hadoop program which is scheduled to run once a day. It takes a bunch of json documents and each document has a time-stamp which shows when the document has been added. My program should only process those documents that are added since its last run. So, I need to keep a state which is a time-stamp showing the last time my hadoop job has run. I was thinking of storing this state in a SQL Server and query that in the driver program of my job. Is it a good solution or might be a better solution ?

p.s. my hadoop job is running on HDInsight. Having said that it is still possible to query the SQL server from my driver program?

HHH
  • 6,085
  • 20
  • 92
  • 164
  • We solved exactly the same problem for our Hadoop jobs in AWS S3. Is your file system HDFS? If yes, then identifying and reading only the files which were not processed last time, may not be that easy. – Manjunath Ballur Dec 02 '15 at 12:49
  • My input are coming from a NOSQL database not HDFS. Could you kindly explain a bit about your approach? My main concern is about those records that get added while the job is running. I don't know what happens to them. Whether they get processed in the current job and in the next job and maybe in both?! – HHH Dec 02 '15 at 17:28
  • I have updated the answer with the solution we had implemented for our workflows in S3. – Manjunath Ballur Dec 04 '15 at 13:28

3 Answers3

1

We had solved this problem for our workflows running in AWS (Amazon Web Services), for the data stored in S3.

Our setup:

  • Data store: AWS S3
  • Data ingestion mechanism: Flume
  • Workflow management: Oozie
  • Storage for file status: MySQL

Problem:

We were ingesting data into Amazon S3, using Flume. All the ingested data was in same folder (S3 is a key/value store and has no concept of folder. Here folder means, all the data had same prefix. For e.g. /tmp/1.txt, /tmp/2.txt etc. Here /tmp/ is the key prefix).

We had a ETL workflow, which was scheduled to run once in an hour. But, since all the data was ingested into same folder, we had to distinguish between the Processed and Un-Processed files.

For e.g. for the 1st hour data ingested is:

/tmp/1.txt
/tmp/2.txt

When the workflow starts for the first time, it should process data from "1.txt" and "2.txt" and mark them as Processed.

If for the second hour, the data ingested is:

/tmp/3.txt
/tmp/4.txt
/tmp/5.txt

Then, the total data in the folder after 2 hours will be:

/tmp/1.txt
/tmp/2.txt
/tmp/3.txt
/tmp/4.txt
/tmp/5.txt

Since, "1.txt" and "2.txt" were already processed and marked as Processed, during the second run, the job should just process "3.txt", "4.txt" and "5.txt".

Solution:

We developed a library (let's call it as FileManager), for managing the list of processed files. We plugged in this library into the Oozie work flow, as a Java action. This was the first step in the workflow.

This library also took care of ignoring the files, which are currently being written into by Flume. When Flume is writing data into a file, those files had "_current" suffix. So, those files were ignored for processing, till they are completely written into.

The ingested files were generated with timestamp as a suffix. For e.g. "hourly_feed.1234567". So, the file names were in ascending order of their creation.

For getting the list of unprocessed files, we used S3's feature of querying using markers (For e.g. if you have 10,000 files in a folder, if you specify marker as the name of the 5,000th file, then S3 will return you files from 5001 to 10,000).

We had following 3 states for each of the files:

  1. SUCCESS - Files which were successfully processed
  2. ERROR - Files which were picked up for processing, but there was an error in processing these files. Hence, these files need to be picked up again for processing
  3. IN_PROGRESS - Files which have been picked up for processing and are currently being processed by a job

For each file, we stored following details in the MySQL DB:

  • File Name
  • Last Modified Time - We used this to handle some corner cases
  • Status of the file (IN_PROGRESS, SUCCESS, ERROR)

The FileManager exposed following interfaces:

  • GetLatestFiles: Return the list of latest Un-Processed files
  • UpdateFileStatus: After processing the files, update the status of the files

Following are the steps followed to identify the files, which were not yet processed:

  1. Query the database (MySql), to check the last file which had status of SUCCESS (query: order by created desc).
  2. If the first step returns a file, then query S3 with the file marker set to the last successfully processed file. This will return all the files, ingested after the last successfully processed file.
  3. Also query the DB to check if there are any files in ERROR status. These files need to be re-processed, because previous workflow did not process them successfully.
  4. Return the list of files obtained from Steps 2 and 3 (Before returning them, mark their status as IN_PROGRESS).
  5. After the job is completed successfully update the state of all the processed file as SUCCESS. If there was an error in processing the files, then update the status of all the files as ERROR (so that they can be picked up for processing next time)

We used Oozie for workflow management. Oozie workflow had following steps:

  1. Step 1: Fetch next set of files to be processed, mark each of their state as IN_PROGRESS and pass them to the next stage
  2. Step 2: Process the files
  3. Step 3: Update the status of the processing (SUCCESS or ERROR)

De-duplication: When you implement such a library, there is a possibility of duplication of records (in some corner cases, same file may be picked up twice for processing). We had implemented a de-duplication logic to remove duplicate records.

Manjunath Ballur
  • 6,287
  • 3
  • 37
  • 48
0

you can rename the result document by using date-time,then your program can process the document according to the name of document.

rainforc
  • 1
  • 2
0

Driver program checking the last run time stamp is good approach, but for storing last run time stamp, you can use a temporary file from HDFS.