0

My context is ;

10 csv files are uploaded to my server during the night .

My process is :

  • Ingestion :

    • Put the files on HDFS
    • Create ORC Hive Table and put data on them .
  • Processing :

    • Spark processing : transformation , cleaning , join ....
    • a lot of chained steps(Spark Job)

I am searching best practices to automate the first part and trigger the second part .

  • Cron , sh , dfs put .
  • Oozie ?
  • Apache Nifi ?
  • Flume ?
  • Telend :(

I also see https://kylo.io/ , It's perfect but i think still young to put it in production.

Thanks in advance .

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
Nabil
  • 1,771
  • 4
  • 21
  • 33

1 Answers1

2

Oozie and Nifi both will work in combination with flume, hive and spark actions.

So your (Oozie or Nifi) workflow should work like this

  1. A cron job (or time schedule) initiates workflow.

  2. First step in work flow is Flume process to load data in desired HDFS directories. You can do this without Flume with just HDFS command but this will help maintain your solution scalable for future.

  3. A hive action to create/update table

  4. Spark actions to execute your custom spark programs

Make sure you take care of error handling in the workflow with proper logging and notifications so that you can oprationalize the workflow in production.

alpeshpandya
  • 492
  • 3
  • 12