Querying billions of records in real time

Question

I am working on a data driven analytical software project that produces reports and recommendation on financial data (transactions). The data consists of 1.7 billion records with 200k new records added every day. Each record describes a transaction with fairly small data (from_account, to_account, amount, time_stamp etc.).

Once written, the data does not need to be changed (so essentially it's a WORM paradigm), but the queries can become quite complex. Some of the queries are AML (Anti-Money-Laundering) logic, which looks for relations between accounts such as "U-Turn" transactions:

A->B->C->D->A

I need to run multiple queries to detect such patterns and the query times are pretty fast as long as each account has a 'normal' amount of transactions. The Problem appears if Account C (in the above example) has suddenly millions of transaction and the query runs 60 or more seconds instead of 0.5 seconds.

I am inclined to use neo4j to search for relations between accounts - but I am not sure if the searches will be fast enough. Other solutions could be in memory DBs such as MemSQL, Redis, or Aerospike - I am also looking at HBase/Hadoop or CouchDB, MongoDB.

Which stack would provide the fastest query results currently possible?

It depends on your data. The advantage of Neo4j is that as your data grows, you can use indexes into your graph to find specific nodes and then specify path traversals between those nodes. So even with billions of nodes you can start with one and traverse the links and your performance remains the same (broad strokes, here) — Brian Underwood, Jun 27 '16 at 10:37
Seems like you need DB for analytics. Try to look at HP Vertica (it has Community Edition with limit in 1TB) or ClickHouse from Yandex. Vertica works pretty fast by comparison with MySQL. — Maxim, Jun 27 '16 at 11:46
@BrianUnderwood - So finding a node and traversing through it's relations would be still fast - does that hold true if a node in the graph turns out to have millions of relations? — Steven M, Jun 27 '16 at 14:19
whe you talk about speed its hard to beat memory based database memcache and redis — Revln9, Jun 27 '16 at 14:29
If account C has millions of transactions, don't read all of them. Sub-sample them down to a thousand or so. — Mike Dunlavey, Jun 27 '16 at 14:46
If you have thousands of relationships for a node that can certainly be a problem in Neo4j. I believe that the makers of Neo4j are working on ways to have "supernodes" with many relationships, but I don't think they're quite there yet. I would try benchmarking an example. — Brian Underwood, Jun 27 '16 at 17:56

score 1 · Answer 1 · answered Jun 28 '16 at 06:01

Each class of database has its strength, and for the AML use case you described a graph database such as neo4j would be the correct choice.?

A document store such as couchbase or Mongo would make little sense, and a key-value store, such as Aerospike and Redis would only make sense if there was a constant path length you were interested in, that could be pre-computed. As you are trying to find all paths that start at a given node, and end in it, regardless of the number of edges, that's not likely.

score 0 · Answer 2 · answered Jun 27 '16 at 15:08

I'll suggest you to choose a memory based database with the appropriate machine 8 or 16 gigs of ram. For implementing the analytics writes try with job queues ex : rabbitMQ at least for the 1.7 billion record . redis or memcache can handle your daily writes (200k) without problem or even tweaking especially if you don't really need transactions (read about batch method for redis).

here is an interesting post about how instagram use redis to map over 300 millions pictures to each user .

http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value

http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram

But be aware that this kind of memory database is a key value store which means no complex data and no complex queries.

An alternative would be to try elasticsearch , it has some benefits that could make the task easier . Verizon uses it to store over 500 Billion of records.That doesn't mean everybody could do that , but at least it shows that it could be done

see this link for elasticsearch :

https://sematext.com/blog/2013/07/08/elasticsearch-refresh-interval-vs-indexing-performance/

I heard that HBase/Hadoop and CouchDB performs well on large sets but can't give you much more informations as i don't really use it .

hope this helps !

Isn't Elasticsearch just Lucene on steroids? I am asking because this problem is sort of about nodes and how they are related. That's why I am looking into neo4j which has also Lucene under the hood. — Steven M, Jun 27 '16 at 15:24
well yes elasticsearch is based on lucene , that doesn't mean it performs like lucene. You're talking about joins ! if you want joins neo4j is definitely the way to go , what i don't kinda like about neo4j is apart from their website , nobody else talks about dealing with real life problems with neo4j. last link for elasticsearch i just found which could make your life easier http://blog.comperiosearch.com/blog/2014/01/30/elasticsearch-indexing-sql-databases-the-easy-way/ https://github.com/jprante/elasticsearch-jdbc Good luck ! — Revln9, Jun 27 '16 at 15:41

Querying billions of records in real time

2 Answers2