We had solved this problem for our workflows running in AWS (Amazon Web Services), for the data stored in S3.
Our setup:
- Data store: AWS S3
- Data ingestion mechanism: Flume
- Workflow management: Oozie
- Storage for file status: MySQL
Problem:
We were ingesting data into Amazon S3, using Flume. All the ingested data was in same folder (S3 is a key/value store and has no concept of folder. Here folder means, all the data had same prefix. For e.g. /tmp/1.txt, /tmp/2.txt etc. Here /tmp/ is the key prefix).
We had a ETL workflow, which was scheduled to run once in an hour. But, since all the data was ingested into same folder, we had to distinguish between the Processed and Un-Processed files.
For e.g. for the 1st hour data ingested is:
/tmp/1.txt
/tmp/2.txt
When the workflow starts for the first time, it should process data from "1.txt" and "2.txt" and mark them as Processed.
If for the second hour, the data ingested is:
/tmp/3.txt
/tmp/4.txt
/tmp/5.txt
Then, the total data in the folder after 2 hours will be:
/tmp/1.txt
/tmp/2.txt
/tmp/3.txt
/tmp/4.txt
/tmp/5.txt
Since, "1.txt" and "2.txt" were already processed and marked as Processed, during the second run, the job should just process "3.txt", "4.txt" and "5.txt".
Solution:
We developed a library (let's call it as FileManager
), for managing the list of processed files. We plugged in this library into the Oozie work flow, as a Java action. This was the first step in the workflow.
This library also took care of ignoring the files, which are currently being written into by Flume. When Flume is writing data into a file, those files had "_current" suffix. So, those files were ignored for processing, till they are completely written into.
The ingested files were generated with timestamp as a suffix. For e.g. "hourly_feed.1234567". So, the file names were in ascending order of their creation.
For getting the list of unprocessed files, we used S3's feature of querying using markers (For e.g. if you have 10,000 files in a folder, if you specify marker as the name of the 5,000th file, then S3 will return you files from 5001 to 10,000).
We had following 3 states for each of the files:
- SUCCESS - Files which were successfully processed
- ERROR - Files which were picked up for processing, but there was an error in processing these files. Hence, these files need to be picked up again for processing
- IN_PROGRESS - Files which have been picked up for processing and are currently being processed by a job
For each file, we stored following details in the MySQL DB:
- File Name
- Last Modified Time - We used this to handle some corner cases
- Status of the file (IN_PROGRESS, SUCCESS, ERROR)
The FileManager
exposed following interfaces:
GetLatestFiles
: Return the list of latest Un-Processed files
UpdateFileStatus
: After processing the files, update the status of the files
Following are the steps followed to identify the files, which were not yet processed:
- Query the database (MySql), to check the last file which had status of SUCCESS (query:
order by created desc
).
- If the first step returns a file, then query S3 with the file marker set to the last successfully processed file. This will return all the files, ingested after the last successfully processed file.
- Also query the DB to check if there are any files in ERROR status. These files need to be re-processed, because previous workflow did not process them successfully.
- Return the list of files obtained from Steps 2 and 3 (Before returning them, mark their status as IN_PROGRESS).
- After the job is completed successfully update the state of all the processed file as SUCCESS. If there was an error in processing the files, then update the status of all the files as ERROR (so that they can be picked up for processing next time)
We used Oozie for workflow management. Oozie workflow had following steps:
- Step 1: Fetch next set of files to be processed, mark each of their state as IN_PROGRESS and pass them to the next stage
- Step 2: Process the files
- Step 3: Update the status of the processing (SUCCESS or ERROR)
De-duplication:
When you implement such a library, there is a possibility of duplication of records (in some corner cases, same file may be picked up twice for processing). We had implemented a de-duplication logic to remove duplicate records.