0

I am doing some research into distributed databases. I have concluded that the hardest part is distributed queries, over multiple records, where the query is quite generic e.g. imagine a transaction table with a "DateTransaction" field. This query (pseudo-SQL) would be quite tricky:

SELECT * FROM Transactions 
WHERE DateTransaction>= '2016-01-01'  and DateTransaction<= '2017-09-01' --between jan 2016 and before sep 2017
ORDER BY DateTransaction DESC 
OFFSET 0
FETCH 100 -- take records between 0 and 99 inclusive

Hand rolling a somewhat efficient solution for a query like this; where the data is split over N nodes; isn't difficult. It's be a bit more difficult with a large OFFSET, but likely pagination would be done by adjusting the date range anyways.

But anyways - I want to avoid hand rolling anything! I am looking for a third-party solution (ideally free) that would let me efficiently do queries like the one in that example over several nodes/machines. Atomicity isn't necessary. Anyone know where to get started?

Someone
  • 11
  • 2

2 Answers2

0

What you need is NewSQL database. And there're many free open source NewSQL databases out there which can satisfy your requirements, like TiDB, CockroachDB.

If you have a preference for pgSQL syntax or your business logic is using PostgreSQL right now, you can try CockroachDB. Otherwise, I suggest you choose TiDB. TiDB supports distributed transactions and distributed join, is fully compatible with MySQL syntax and has many successful use cases.

Disclaimer: I work at PingCAP which develops TiDB.

Lilian Lee
  • 190
  • 1
  • 12
0

One idea is to partition your dataset if you're going to query by that range and limit it just to a few servers. You might want to check how table partitioning works in different databases, and plan to partition your data accordingly.