I am working on a data driven analytical software project that produces reports and recommendation on financial data (transactions). The data consists of 1.7 billion records with 200k new records added every day. Each record describes a transaction with fairly small data (from_account, to_account, amount, time_stamp etc.).
Once written, the data does not need to be changed (so essentially it's a WORM paradigm), but the queries can become quite complex. Some of the queries are AML (Anti-Money-Laundering) logic, which looks for relations between accounts such as "U-Turn" transactions:
A->B->C->D->A
I need to run multiple queries to detect such patterns and the query times are pretty fast as long as each account has a 'normal' amount of transactions. The Problem appears if Account C (in the above example) has suddenly millions of transaction and the query runs 60 or more seconds instead of 0.5 seconds.
I am inclined to use neo4j to search for relations between accounts - but I am not sure if the searches will be fast enough. Other solutions could be in memory DBs such as MemSQL, Redis, or Aerospike - I am also looking at HBase/Hadoop or CouchDB, MongoDB.
Which stack would provide the fastest query results currently possible?