I just switched from Oracle to using Cassandra 2.0 with Datastax driver and I'm having difficulty structuring my model for this big data approach. I have a Persons table with UUID and serialized Persons. These Persons have lists of addresses, names, identifications, and DOBs. For each of these lists I have an additional table with a compound key on each value in the respective list and the additional person_UUID column. This model feels too relational to me, but I don't know how else to structure it so that I can have index(am able to search by) on address, name, identification, and DOB. If Cassandra supported indexes on lists I would have just the one Persons table containing indexed lists for each of these.
In my application we receive transactions, which can contain within them 0 or more of each of those address, name, identification, and DOB. The persons are scored based on which person matched which criteria. A single person with the highest score is matched to a transaction. Any additional address, name, identification, and DOB data from the transaction that was matched is then added to that person.
The problem I'm having is that this matching is taking too long and the processing is falling far behind. This is caused by having to loop through result sets performing additional queries since I can't make complex queries in Cassandra, and I don't have sufficient memory to just do a huge select all and filter in java. For instance, I would like to select all Persons having at least two names in common with the transaction (names can have their order scrambled, so there is no first, middle, last; that would just be three names) but this would require a 'group by' which Cassandra does not support, and if I just selected all having any of the names in common in order to filter in java the result set is too large and i run out of memory.
I'm currently searching by only Identifications and Addresses, which yield a smaller result set (although it could still be hundreds) and for each one in this result set I query to see if it also matches on names and/or DOB. Besides still being slow this does not meet the project's requirements as a match on Name and DOB alone would be sufficient to link a transaction to a person if no higher score is found.
I know in Cassandra you should model your tables by the queries you do, not by the relationships of the entities, but I don't know how to apply this while maintaining the ability to query individually by address, name, identification, and DOB.
Any help or advice would be greatly appreciated. I'm very impressed by Cassandra but I haven't quite figured out how to make it work for me.
Tables:
- Persons [UUID | serialized_Person]
- addresses [address | person_UUID]
- names [name | person_UUID]
- identifications [identification | person_UUID]
- DOBs [DOB | person_UUID]
I did a lot more reading, and I'm now thinking I should change these tables around to the following:
- Persons [UUID | serialized_Person]
- addresses [address | Set of person_UUID]
- names [name | Set of person_UUID]
- identifications [identification | Set of person_UUID]
- DOBs [DOB | Set of person_UUID]
But I'm afraid of going beyond the max storage for a set(65,536 UUIDs) for some names and DOBs. Instead I think I'll have to do a dynamic column family with the column names as the Person_UUIDs, or is a row with over 65k columns very problematic as well? Thoughts?
It looks like you can't have these dynamic column families in the new version of Cassandra, you have to alter the table to insert the new column with a specific name. I don't know how to store more than 64k values for a row then. With a perfect distribution I will run out of space for DOBs with 23 million persons, I'm expecting to have over 200 million persons. Maybe I have to just have multiple set columns?
- DOBs [DOB | Set of person_UUID_A | Set of person_UUID_B | Set of person_UUID_C]
and I just check size and alter table if size = 64k? Anything better I can do?
I guess it's just CQL3 that enforces this and that if I really wanted I can still do dynamic columns with the Cassandra 2.0?
Ugh, this page from Datastax doc seems to say I had it right the first way...: When to use a collection