How to design a distributed write-heavy data store

Question

It's actually an interview question I'm thinking of for 2 month and can't find a suitable architecture.

The problem

We want to build a small analytics system for fraud detection on orders.

System has the following requirements

Not allowed to use any technology from the market (MySql, Redis, Hadoop, S3 etc)
Needs to scale as the data volume grows
Just a bunch of machines, with disks and decent amount of memory
10M Writes/Day

The system needs to provide following API

/insertOrder(order): Order
Add an order to the storage. The order can be considered blob with 1-10KBs in size, with an orderId , beginTime, and finishTime as distinguished fields
/getLongestNOrdersByDuration(n: int, startTime: datetime, endTime: datetime): Order[]
Retrieve the longest N orders that started between startTime and endTime,
as measured by duration finishTime - beginTime
/getShortestNOrdersByDuration(n: int, startTime: datetime, endTime: datetime): Order[]
Retrieve the shortest N orders that started between startTime and endTime,
as measured by duration finishTime - beginTime

Can you further comment what you mean by "Not allowed to use any technology from the market"? I would take a look at the concepts of streaming processing and event driven solutions. An order could be considered an event stored in an event store. Streaming processing allows you to continuously analyse the events as they occur triggering other events for interested parties in case of fraudulous attempts. — KDW, Sep 23 '21 at 17:05
I am not sure if you are allowed to consider solutions like Kafka? It is able to fulfil the design constraints of scalability, a bunch of machines and 10M writes/day (depending on the hardware specs and number of nodes in your cluster of course). If not allowed to evaluate such solutions, I wonder why not? — KDW, Sep 23 '21 at 17:08
@KDW it's an artificial restriction since the problem is an interview question and tests the ability to make it up by ourselves — Andriy Shevchenko, Sep 23 '21 at 21:09

score 0 · Accepted Answer · answered Sep 24 '21 at 13:43

0

Look at using druid database. If you time series data

https://druid.apache.org/ - This has been used as analytics db at scale in Fortune 500 companies

answered Sep 24 '21 at 13:43

Kris

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 24 '21 at 18:32
The description mentions "Not allowed to use any technology from the market", therefore using Druid is not acceptable. – Leonid Dashko Jun 06 '23 at 06:43

1 Answers1