We have been using Cascading framework for creating ETL.
Cascading gives.
- optimized joins
- Parallel running jobs
- Creating checkpoints
- Developers can work on their favorite language(java,ruby,scala,clojure)
- Unit Testing.
Now we have two options converting some X ETL(which is costly) jobs into hadoop jobs
- Cascading work flows.
- Talend jobs.
My question is.
- Talend uses pig, hive, etc as components to create a job. Then do we have some benefits on performance or does talend does any improvisation on it?
- As far as Talend is concerned do we need to worry about unit testing(which Cascading framework provides)?
- If we choose Talend over cascading for creating jobs(converting X ETL to hadoop jobs), then is it a good option.
converting X ETL to cascading workflows will require to create all the components available in the given X ETL, but will be one time activity. Then we need to think on other feature also which are provided by Talend Studio like:
a. Data quality. b. Data Profiling. c. Data lineage, etc.
- As far as maintainability is concerned Cascading jobs are pretty well managed, can any one give some info on talend.
Bottom line is I am creating a conversion tool from X ETL to hadoop jobs. And I need to choose from Cascading framework or Talend.