We have a large dataset of historical transactions, and we have a system that is required to check new transactions against each historical transaction in this dataset.
This involves running a algorithm on each historical transaction which produces a matching score against the new transaction. This means going through the transactions sequentially, and we can't use indexing or hashing to try reduce the number of transactions that need to be checked.
A couple of other points, transactions are always added to the dataset and they are never evicted. In addition we do distribute the processing by splitting the dataset across workers on different servers.
Just now the system uses a Java Collection class to cache the transaction dataset in memory. This is mainly because of speed requirements as it provides fast sequential access to the transactions.
What I'd like to know are there any caching systems such as EHCache that would help us distribute the dataset across different servers but still provide fast sequential access to the records in the cache.