10

Existing process - raw structure data are copied into a staging layer of Redshift. Then use ETL tools such as Informatica, Telend to do incremental loading into Fact and Dimension table of Datamart/datawarehouse. All joins happen within database layer(ETL pushes queries into DB) - Can Spark replace ETL tool and do the same processing and load data into Redshift? - What are the advantages and disadvantages of this architecture?

Dipankar
  • 333
  • 1
  • 3
  • 11
  • 2
    Watch this video from Spark Summit https://www.youtube.com/watch?v=vZhSbs1xLx4 :) – T. Gawęda Nov 25 '16 at 22:23
  • 2
    My understanding of Redshift data integration is that it is ELT - extract, load, transform. For this you don't need an ETL tool just a scheduling tool to run and manage your SQL statements. I don't know what Apache spark does. To me the only downsides are: 1. You don't have a GUI to see all your nice mappings; 2. Performing external functions (zipping, FTPing, calling webservices) are often difficult to do from within a database. – Nick.Mc Nov 26 '16 at 01:07

2 Answers2

4

I have worked extensively on projects to migrate the existing ETL jobs into spark for last 4 years.

The problem of ETL jobs were as follows

  1. They didn't give us a strict SLA. The jobs were sharing the same resource pool, thus prioritizing was hard. Everyone made their jobs as business critical.

  2. Another important problem was the cost of the ETL based job was high as we were paying the provider.

  3. Scale was another important issue. We required ETL on gigantic scale, that we found too expensive.

Thus, we migrated all the ETLs to spark jobs. Spark and hadoop both being open source we didn't have any additional cost issue except the compute.

Spark support for SQL improved dramatically over the time. You can run ML/Graph queries and normal ETL on the same data frame. Spark joins are fast and can be optimized for different dataset. You get more fine-grained control over your transformations and join.

We started by using a Long running cluster with the support for spark and other big data tools. We unified the platform so that all the customer can use it. We slowly migrated all the ETL jobs to spark jobs.

We do use Redshift for reporting but all the heavy lifting of finding insights from data, joins, managing incoming data and merge that with existing snapshot all done in spark.

We were able to save millions of dollars by moving away from existing ETL jobs and migrating them to Spark.

My two pennies on this is that, eventually spark, hive big data, hadoop will eventually outrun the ETL jobs. I am not saying ETLs will be eviscerated but definitely the open source solution will become the dominant force in this domain.

Avishek Bhattacharya
  • 6,534
  • 3
  • 34
  • 53
1

May I know the reason for replacing Informatica with Spark. Informatica BDM 10.1 edition comes with Spark execution engine, This converts the Informatica mappings into Spark equivalent (Scala code) and executes this on the cluster. Also, in my opinion, Spark is more suitable for the data that does not intermediately, where as in case of ETL, the data changes from transformation to transformation!!!

Volamr
  • 66
  • 4
  • 1
    a) Informatica is expensive. b) With huge volume of data(big data) moving data from Storage to a processing server(Informatica) is expensive(with regards to processing time) 3)Informatica may use Spark engine to solve above problem of processing data where it resides. You ave to pay informatica just for UI and rapid development environment(consider Apache ni-fi). 4)Spark can not only do ETL for you, it can be useful if you have use cases for Machine learning, Real-time streaming etc. I would love to hear what are the drawbacks of using Spark as replacement of Informatica. – Dipankar Nov 29 '16 at 15:10