2

We have a large dataset of historical transactions, and we have a system that is required to check new transactions against each historical transaction in this dataset.

This involves running a algorithm on each historical transaction which produces a matching score against the new transaction. This means going through the transactions sequentially, and we can't use indexing or hashing to try reduce the number of transactions that need to be checked.

A couple of other points, transactions are always added to the dataset and they are never evicted. In addition we do distribute the processing by splitting the dataset across workers on different servers.

Just now the system uses a Java Collection class to cache the transaction dataset in memory. This is mainly because of speed requirements as it provides fast sequential access to the transactions.

What I'd like to know are there any caching systems such as EHCache that would help us distribute the dataset across different servers but still provide fast sequential access to the records in the cache.

user1232555
  • 1,099
  • 3
  • 11
  • 18
  • can those "computational compares" be reduced to simple byte operations? if so then maybe you could just dump them on disk and memory-map the files in chunks? – the8472 May 19 '15 at 23:12
  • "running into problems with the number/size of the transactions and the JVM garbage collection" what does this mean? If the GC is collecting your objects, then my definition there's no reference to them – Steve Kuo May 20 '15 at 00:11
  • Is this a caching (eviction) or distributed computation (latency) problem? It sounds like the latter, where you're on-heap storage is for speed but causes GC pressure. Depending on your computation you could move to an embedded database (leveraging prefetching), off-heap binary computations, precomputed indexes, etc. If the computation is not truly O(n), then you may have modeling solutions. A good answer requires more detail. – Ben Manes May 20 '15 at 06:46
  • I can see from the comments that my question could have been better worded. I will update the question to try and make it clearer what the problem is. – user1232555 May 20 '15 at 14:13
  • Don't discount the speed of disk prefetching for sequential reads. H2 is fast, embedded, often used for unit tests, and can be persisted. You might try a quick prototype to see if its fast enough - serialization will probably be the real bottleneck to resolve. You might then explore an in-memory data grid which can distribute the calculation with many storage modes. If you want to reduce memory waste, you could intelligently prefetch from redis as you process the result stream. – Ben Manes May 21 '15 at 01:57

1 Answers1

0

Reinventing the wheel is so tempting! When Oracle has in memory database why can't we do the same... Let me try too. What about hashing array of bytes and keep these hashes? And when there is collision of hashes then go to the real database and double check whole array. So tempting...

Alex
  • 4,457
  • 2
  • 20
  • 59
  • its a complex computation on each historical transaction to determine a match score so hashing the bytes won't work. I could use an embedded DB for this but I thought its performance would be too slow given that there would have to be a call out to it for each record. – user1232555 May 20 '15 at 14:18
  • Please, explain. Here are my thoughts, Each transaction is unique by array of bytes. What computations are you talking about? How to create this array? Array shouldn't change. Or does it? In this case it's possible that somehow records could change and not be unique anymore. – Alex May 20 '15 at 14:27