3

I was wondering if you could tell me which NoSQL db or technology/tools should I use for my scenario. We are looking at replacing our OLAP cubes based on SQL server Analysis services with an open source technology coz the data is getting too huge to manage and queries are taking too long to return. We have followed every rule in the book to shard the data, optimize the design of the cube by using aggregations and partitions etc and still some of our distinct count queries take 1-2 mins :( The data size of our fact table is roughly around 250GB. And there are 10-12 dimensions connected in star schema fashion.

So we decided to give open source technologies like Hadoop/HBase/NoSQL dbs a try to see if they can solve our OLAP scenarios with minimal setup and onboarding.

Our main requirements for the new technology are

  1. It has to get blazing fast or instantaneous results for distinct count queries ( < 2 secs)

  2. Supports the concept of measures and dimensions (like in OLAP).

  3. Support SQL like query language as many of our developers are SQL experts.
  4. Ability to connect Excel/Tableau to visualize the data.

As there are so many new technologies and tools in the open source world today, I was hoping if you can help me point to the right direction.

Community
  • 1
  • 1
user330612
  • 2,189
  • 7
  • 33
  • 64
  • I'm far from being an no-sql technologies expert, but as far as I know, the point of no-sql database is not to analyse data but only to store and retrieve them easily. Analysing these data requiere data-processing engine such as Apache SPARK . Big data processing is in fact more a constant succession of long batch processing (from minutes to hours) than real time analysis with quick queries. About pure performance for distinct count queries, and assuming 250 gb of ram is an option, MongoDB can be use as a pure in-memory database. – GaelFG Jan 27 '15 at 08:46
  • @GaelFG There are also NoSQL technologies which put their focus on data analyzing, like Hadoop+HBase or Neo4j. That's the problem with the term NoSQL: It's such a wide field that any generalizations are a gross oversimplification. The only statement you can make about NoSQL in general is "Technologies which store data without using the SQL". – Philipp Jan 27 '15 at 09:04
  • Does mongodb support SQL queries ? Are there visualization tools like tableau that can connect to mongodb instances or cluster out of the box w/o writing much code like a driver ? What if we don't have a 250GB MACHINES? does it support concept of dimensions and measures that allow slicing and data across multiple dimensions ? – user330612 Jan 27 '15 at 15:59

2 Answers2

4

Notes: I'm from Apache Kylin team.

Please refer to below answers which may bring some idea for you:

Our main requirements for the new technology are It has to get blazing fast or instantaneous results for distinct count queries ( < 2 secs)

--Luke: 90%tile query latency less than 5s is our current statistics. For <2s on distinct count, how many data you will have? Is approximate result ok?

Supports the concept of measures and dimensions (like in OLAP).

--Luke: Kylin is pure OLAP engine which has dimension (supports hierarchy also) and measure (Sum/Count/Min/Max/Avg/DistinctCount) definition

Support SQL like query language as many of our developers are SQL experts. --Luke: Kylin support ANSI SQL interface (most SELECT functions)

Ability to connect Excel/Tableau to visualize the data.

--Luke: Kylin has ODBC Driver works very well with Tableau, Excel/PowerBI will coming soon.

Please let's know if you have more questions.

Thanks.

LukeHan
  • 260
  • 1
  • 2
  • 7
1

Looks like "Kylin" http://www.kylin.io/ is my answer. This has all the requirements that I wanted and even more. I'm gonna give it a try now! :)

user330612
  • 2,189
  • 7
  • 33
  • 64