0

I am gathering comparisons of different ETL tools(Informatica, DataStage, Ab Initio) with respect to their usability and performance, I have worked on informatica and Ab Initio and with the web help I am able to find the key essential factors and differences between the two, But I am unable to find any useful stuff on DataStage Vs Ab Initio, what I have done is below:

1. DS supports one type of parallelism where Ab-Initio supports 3(data,component,pipeline)

2. Debugging is a lot easir in Ab-Initio as it has error port on all components.

3. Ab Initio works well with masive volume than DS

Can anyone help me to gather more of the differences, architecture wise, performance wise or any other?

KeenLearner
  • 685
  • 1
  • 8
  • 25

1 Answers1

0

I don't know anything about Ab-Initio. But can only comment on your points.

1) Technically on paper DataStage supports two types of data streaming: Data pipelining (think server jobs) and Parallel partitioning (parallel jobs), with repartitioning abilities and more whereby you can mix the two concepts in a single job. But there is FAR more capability than that with developer that knows what they're doing. The component parallelism can very easily be achieved/replicated (making assumptions on how it works in the background) in DS via the use of the sequencer job, which sequences server/parallel jobs. With it you can run multiple parallel/server jobs in parallel that each can process a data stream. You can even repeat a single parallel jobs run into many, many instances so that one job design can run in many instances at once. With each instance running with different metadata.

So if you use a parallel jobs like that, not only can you achieve your component parallelism, but each component runs as it's own parallel partition stream. This allows for massive data processing capabilities. See point three.

2) There are a few tools and methods on hand to debug a single job: of course there are the job run logs which detail each stages log outputs in a job, there is a stage that peeks (outputs) data samples to the same log so you can see the data values in the actual log, there is also your IDE style breakpoint debugger where you set a breakpoint in the job and the job stops at that breakpoint during the run and you can assess your record then and there, plus additional developer best practices to best develop your job in phases, bigger and bigger, to lower the chance of difficult bug to solve. You can also disable the rolling up of logs into summaries so that you can get verbose logs per stage in a jobs logs. Each stage outputs it's own logs.

3) Information Server (specifically datastage) has already moved into the big data and cloud data processing space (greatly from version 11.7.0.1 just released yesterday). It's capable of incredibly massive volumes of data (structured and unstructured) on premise and in the cloud. Whether your data is in a traditional database, is an unstructured source, in Amazon S3, or in Hive (on hadoop), it can be crunched hard and in massive parallel streams. Couple it's range of connection and parallel processing capabilities with the processing engines linear scalability capabilities. You can configure Information Server (i.e. datastage) to run as a Grid computer. Allowing for truly great volume processing power. I'm not sure Ab-Initio is capable of that.

On a sort of side note, if I may: I feel folks make a mistake when they look at datastage as it's own tool that is compared to things like Ab-Initio or Informatica. DataStage is but a component of a tool suite that is IBM Information Server (holds many tools). When you look at it that way, there is nothing out there that compares, i think.

Nothing that manages metadata and data lineage and shares it between so many tools, roles, and functions in a business to build a wholistic picture for the business. For example, if you want to check-in or check-out jobs, then use the separate tool that comes included with DataStage called Information Server Manager (for inter-environment deployment, package deployment, version control, and so on). That integrates into standalone version control systems. If you use the (VERY) new Flow Designer (the web based version of DataStage). You can actually commit to a Git repos.

Not even touching on how, with the functions of other tools in the suite, you can expose DataStage jobs as web services. Or setup real time processing using DataStage and the Data Replication tool.

Just some examples (of a great number) why looking at Information Server itself for ETL rather than just DataStage is beneficial.

  • Thank you @sam, very informative, much of the detail is new to me .. :) – KeenLearner Jun 15 '18 at 19:55
  • Totally welcome. I hope the some of the concepts and terms made sense. I'd be happy to elaborate more if you'd like. –  Jun 15 '18 at 20:09