10

I've touched a Teradata. I've never touched hadoop, but since yesterday, I am doing some research on that. By description of both, they seem quite interchangable, but in some papers it is written that they serve for different purposes. But all I found is vague. I am confused.

Has anybody experience with both of them? What is the serious difference between them?

Simple Example: I want to build ETL which will transform billions rows of raw data and organize them to DWH. Then do some resources expensive analysis on them. Why use TD? Why Hadoop? or why not?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
John
  • 503
  • 2
  • 10
  • 25
  • 1
    Is your raw data structured or unstructured? What is the arrival rate of this data? Can you explain what expensive analysis means? What is your service level expectation for this analysis to be completed within? Does your company have an existing base of individuals with skills in SQL, R, SAS, and/or predictive modeling? There are significant differences between the two. It boils down to understanding whether your business problem can be solved by the traditional RDBMS paradigm from data modeling -> ETL -> Analytics with SQL or if you need something more that MapReduce can provide. – Rob Paller Feb 01 '13 at 14:51
  • Raw data are structured. Arrival rate is every day couple big chunks of data. Expensive analysis: CPU expensive with some query expensive prearrangement of the data (an ETL of structured data to abstract data for algorithms we can say), but these analyses will run outside in some specific applications, so it is not relevant. But the essence of my question is: Teradata is really expensive. Can I substitute Teradata by Hadoop in industries like banking with the possibility of the same performance, without serious risks(additional costs of implementation or even some unpredictable failure etc)? – John Feb 01 '13 at 16:21
  • They both play along. There are areas where Teradata is recommended and areas where Hadoop is. Teradata is now moving to a [Unified Architecture](http://www.teradata.com/newsrelease.aspx?id=20511) so both Hadoop and Teradata can be integrated and can complement each other. – Raniendu Singh Feb 13 '13 at 09:57

4 Answers4

9

I think this article titled 'MapReduce and Parallel DBMSs: Friends or Foes' does quite a good job describing the situations where each technology works best. In a nutshell, Hadoop is excellent for storing unstructured data and running parallel transformations to 'sanitize' incoming data, where DBMSs excel at executing complex queries quickly.

ryanbwork
  • 2,123
  • 12
  • 12
4

Hadoop, Hadoop with Extensions, RDBMS Feature/Property Comparison

I am not an expert in this area, but in the coursera.com course, Introduction to Data Science, there is a lecture titled: Comparing MapReduce and Databases as well as a lecture on Parallel databases within the map reduce section of the course.

Here is a summary from these lectures on the comparison of MapReduce vs. RDBMS (not necessarily parallel RDMBS). One point to remember is that the comparison is different if you include extensions to Hadoop like PIG, Hive, etc. I will put in () MapReduce extensions that add some of these functionality/properties.

Some functionality/properties that RDBMS have but not native MapReduce:

  • Declaritive query languages -(Pig, HIVE)
  • Schemas (Hive, Pig, DyradLINQ, Hadapt)
  • Logical Data Independence
  • Indexing (Hbase)
  • Algebraic Optimization (Pig, Dryad, HIVE)
  • Caching/Materialized Views
  • ACID/Transactions

MapReduce (relative to regular RDBMS not necessarily Parallel RDMBS)

  • High Scalability
  • Fault-tolerance
  • “One-person deployment”
Yaniv
  • 757
  • 6
  • 5
3

I've been asked this question several times, the answer that I usually give is a car analogy (which is pretty silly because I'm not a car person - but it seems to work)

  • Teradata is the car/dbms for the masses - it is reliable, mature, works well and is there when you need it. It is difficult (compared to Hadoop) to customise and add functionality to the base product.
  • Hadoop is the car/dbms for the enthusiast - it isn't as reliable or mature, it works well so long as you attend to it. It is easy (compared to Teradata) to customise and add functionality to the base product.

Put another way, Teradata is the reliable workhorse where you put your mission critical process (operational reporting, enterprise reporting, decision support etc). Hadoop is the place where you can do alot of this stuff, but don't be surprised if you come in one morning and find that your regulatory reports can't be produced because someone applied a patch or you've suddenly got a "too many small files" problem.

To loop back into the analogy, if you don't want to be too techy and the manufacturers product (dbms and/or car) works for you out of the box, Teradata is a good option. On the other hand, if you like to tinker under the hood, swap out the carburettor (or whatever), adjust the gear ratios, tweak the fuel air mixture depending on whether you are country or city driving, bolt on a Turbo charger and/or your family complain about how long you spend in the garage on weekends - Hadoop is the place for you.

IMHO, Most, if not all organisations need both. I hope this helps :-)

GMc
  • 1,764
  • 1
  • 8
  • 26
1

To Begin with, Vanilla Apache Hadoop is 100% open source. But if you need commercial support along with consultancy there are companies like Cloudera, MapR, HortonWorks, etc.

Hadoop is backed by a growing community fixing bugs and making improvements on a consistent basis. Hadoop storage model HDFS is based on Google's GFS architecture which is proven to handle large quantities of data. Furthermore Hadoop analysis model Map Reduce is based on Google's Map Reduce Model.

Hadoop is used by Tech Giants like Facebook, Yahoo, Twitter, EBay etc to store and analysis they high volume of data real time as well as passively.

For your question ETL systems read these slides where you will see.

Ok now Why Hadoop?

  1. Open Source
  2. Proven Storage and Analysis model for Large Quantities of data
  3. Minimum Hardware Requirement to setup and run.

Ok now Why TD?

  1. Commercial Support
shazin
  • 21,379
  • 3
  • 54
  • 71
  • 1
    Ok, now, in your good answer, I am only missing "OK, now Why Teradata?" – John Jan 31 '13 at 10:18
  • 1
    Two nitpicks: there is lot commercial support for Hadoop as well, and Hadoop MapReduce is for offline batch analytics not for realtime queries. – Thomas Jungblut Jan 31 '13 at 10:33
  • Yes, I have mentioned about the commercial support hadoop has, and I was referring to real time queries by means of using HBase on top of Hadoop HDFS, not Map Reduce model on top of HDFS. – shazin Jan 31 '13 at 11:55