0

I am going to deal with a huge amount of data in my project. I have read about big data concepts but never used it yet. But reading all those Big Data Documents I am still not sure whether my requirement needs Big Data or is it good to handle with traditional relational database.

Here is some information about my DB.

My main DB is a repository for different data sources. Each of this data sources deals with same kind of data (data in same domain), but some data sources contain extra fields which in not available in others and some contain less. In other words some of the data fields in these data sources are same, but some are different. So my core DB should contain all those fields. Total fields in my core DB should be approximately 2000 fields and it may contain 10 to 20 million records.

The DB operation which is happening in my core DB will be data insertion and reading (searching). Since it deals with huge amount of data I was thinking to use big data concepts. But I am still not sure whether this suits for big data. Because some amount of my data has similar characteristics (same fields) and some contain extra information. And I need all the kind of searching fast in my DB. Thanks.

Community
  • 1
  • 1
Dev
  • 309
  • 1
  • 8
  • 24
  • You still have answer, what is more important write into DB or reading/searching on DB? How data from your core DB will flow to target audience/solutions/tools? and what are these targets? Is it transactional live? Data size? – Murali Mopuru Mar 18 '15 at 10:25
  • You still have answer, what is more important write into DB or reading/searching on DB? --> I didn't get what you mean here How data from your core DB will flow to target audience/solutions/tools? --> I have to create lots of application which is centered around this core DB. Yes it should be real time data.... Data size is 10 -20 million of records. – Dev Mar 19 '15 at 08:48
  • 1
    This is NOT a HUGE amount of data. But if you think that you need a table with 2000 attributes then you really need to rethink your design, regardless what platform you implement it on. – symcbean Mar 19 '15 at 10:21
  • yes..its not like 2000 fields are in one table. But my concern is should I go for no sql ? there will be application created based on this data analysis.Whether relational DB is good for that or need to go with no sql? – Dev Mar 23 '15 at 03:31
  • 10-20 million rows? That is not BIG Data. That is tiny. I have worked on MySQL tables of 750 000 000 rows and 1 TB in size and performance was good. – Namphibian Mar 23 '15 at 21:36
  • For more advice, please provide some examples of the type of searches you need to perform and partition tolerance you need, e.g. single-datacenter, multi-datacenter, etc. The specific use cases will drive the best selection. – Grokify Mar 25 '15 at 16:39

1 Answers1

6

Relational databases like MySQL can handle billions of rows / records so the decision will depend on your use case(s). For Big Data NoSQL systems, it is very important to understand how the strengths and limitations of each system map to your use case(s) as they can behave very differently.

Here are some MySQL examples:

In the second example, they moved from MySQL to Redis because they need to store the equivalent of 359 billion rows, far more than the 950 million they were storing in MySQL.

Given that you say you have fast searching requirements, it is important to understand what kind of searches you need as different databases have different searches they support. Additionally, some supported searches may have limited functionality. If you have search requirements that go beyond the core data store functionality, often times a full text solution will be added, for example, using Cassandra for the data store and Elasticsearch for the search component.

To provide some background for this decision, it's useful and important to consider your requirements with respect to the CAP Theorem which states that distributed computer systems can provide some but not all of the following guarantees (from Wikipedia):

  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it succeeded or failed)
  • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

http://en.wikipedia.org/wiki/CAP_theorem

Graphically, you can see how different database solutions including MySQL and NoSQL solutions map out here:

enter image description here

If you provide more information on your use case(s), you can get more detailed responses.

Grokify
  • 15,092
  • 6
  • 60
  • 81