How to optimize Cassandra model while still supporting querying by contents of lists

Question

I just switched from Oracle to using Cassandra 2.0 with Datastax driver and I'm having difficulty structuring my model for this big data approach. I have a Persons table with UUID and serialized Persons. These Persons have lists of addresses, names, identifications, and DOBs. For each of these lists I have an additional table with a compound key on each value in the respective list and the additional person_UUID column. This model feels too relational to me, but I don't know how else to structure it so that I can have index(am able to search by) on address, name, identification, and DOB. If Cassandra supported indexes on lists I would have just the one Persons table containing indexed lists for each of these.

In my application we receive transactions, which can contain within them 0 or more of each of those address, name, identification, and DOB. The persons are scored based on which person matched which criteria. A single person with the highest score is matched to a transaction. Any additional address, name, identification, and DOB data from the transaction that was matched is then added to that person.

The problem I'm having is that this matching is taking too long and the processing is falling far behind. This is caused by having to loop through result sets performing additional queries since I can't make complex queries in Cassandra, and I don't have sufficient memory to just do a huge select all and filter in java. For instance, I would like to select all Persons having at least two names in common with the transaction (names can have their order scrambled, so there is no first, middle, last; that would just be three names) but this would require a 'group by' which Cassandra does not support, and if I just selected all having any of the names in common in order to filter in java the result set is too large and i run out of memory.

I'm currently searching by only Identifications and Addresses, which yield a smaller result set (although it could still be hundreds) and for each one in this result set I query to see if it also matches on names and/or DOB. Besides still being slow this does not meet the project's requirements as a match on Name and DOB alone would be sufficient to link a transaction to a person if no higher score is found.

I know in Cassandra you should model your tables by the queries you do, not by the relationships of the entities, but I don't know how to apply this while maintaining the ability to query individually by address, name, identification, and DOB.

Any help or advice would be greatly appreciated. I'm very impressed by Cassandra but I haven't quite figured out how to make it work for me.

Tables:

Persons [UUID | serialized_Person]
addresses [address | person_UUID]
names [name | person_UUID]
identifications [identification | person_UUID]
DOBs [DOB | person_UUID]

I did a lot more reading, and I'm now thinking I should change these tables around to the following:

Persons [UUID | serialized_Person]
addresses [address | Set of person_UUID]
names [name | Set of person_UUID]
identifications [identification | Set of person_UUID]
DOBs [DOB | Set of person_UUID]

But I'm afraid of going beyond the max storage for a set(65,536 UUIDs) for some names and DOBs. Instead I think I'll have to do a dynamic column family with the column names as the Person_UUIDs, or is a row with over 65k columns very problematic as well? Thoughts?

It looks like you can't have these dynamic column families in the new version of Cassandra, you have to alter the table to insert the new column with a specific name. I don't know how to store more than 64k values for a row then. With a perfect distribution I will run out of space for DOBs with 23 million persons, I'm expecting to have over 200 million persons. Maybe I have to just have multiple set columns?

DOBs [DOB | Set of person_UUID_A | Set of person_UUID_B | Set of person_UUID_C]

and I just check size and alter table if size = 64k? Anything better I can do?

I guess it's just CQL3 that enforces this and that if I really wanted I can still do dynamic columns with the Cassandra 2.0?

Ugh, this page from Datastax doc seems to say I had it right the first way...: When to use a collection

Don Branson · Answer 1 · 2014-04-17T17:55:05.913

This answer is not very specific, but I'll come back and add to it when I get a chance.

First thing - don't serialize your Persons into a single column. This complicates searching and updating any person info. OTOH, there are people that know what they're saying that disagree with this view. ;)

Next, don't normalize your data. Disk space is cheap. So, don't be afraid to write the same data to two places. You code will need to make sure that the right thing is done.

Those items feed into this: If you want queries to be fast, consider what you need to make that query fast. That is, create a table just for that query. That may mean writing data to multiple tables for multiple queries. Pick a query, and build a table that holds exactly what you need for that query, indexed on whatever you have available for the lookup, such as an id.

So, if you need to query by address, build a table (really, a column family) indexed on address. If you need to support another query based on identification, index on that. Each table may contain duplicate data. This means when you add a new user, you may be writing the same data to more than one table. While this seems unnatural if relational databases are the only kind you've ever used, but you get benefits in return - namely, horizontal scalability thanks to the CAP Theorem.

Edit:

The two column families in that last example could just hold identifiers into another table. So, voilà you have made an index. OTOH, that means each query takes two reads. But, still will be a performance improvement in many cases.

Edit:

Attempting to explain the previous edit:

Say you have a users table/column family:

CREATE TABLE users (
    id uuid PRIMARY KEY,
    display_name text,
    avatar text
);

And you want to find a user's avatar given a display name (a contrived example). Searching users will be slow. So, you could create a table/CF that serves as an index, let's call it users_by_name:

CREATE TABLE users_by_name (
    display_name text PRIMARY KEY,
    user_id uuid
}

The search on display_name is now done against users_by_name, and that gives you the user_id, which you use to issue a second query against users. In this case, user_id in users_by_name has the value of the primary key id in users. Both queries are fast.

Or, you could put avatar in users_by_name, and accomplish the same thing with one query by using more disk space.

CREATE TABLE users_by_name (
    display_name text PRIMARY KEY,
    avatar text
}

Thanks for the quick reply. Instead of having a Person table I could just have a UUID in each individual table, but this would result in having to query all tables again when match is chosen and doesn't seem to be a performance issue thus far. I'm not sure I followed the rest, making a separate table for each query is what I thought I did. And I didn't follow your edit at all. — jwalk, Apr 17 '14 at 17:37
If what you're saying is that if I want to find all Persons with the same name and DOB, I should make a column family for that specific goal, I'm not sure how to do that. Lets say my person has 4 names and 3 DOBs, do I then insert name1-dob1, name1-dob2, name1-dob3, name2-dob1...etc.? And everytime I add an additional name or DOB I have to query for all current and insert many more such? Sorry if I misunderstood. — jwalk, Apr 17 '14 at 17:50
I've edited to provide some specifics around a simpler case. I'll need to take some time to read your particular case and expand on this answer. — Don Branson, Apr 17 '14 at 17:56
OK, thanks, I understand now, this relates to the initial "don't need a Person table". But in these selects I actually don't need to query the Person table, I just use the Persons UUID column which is in the individual tables, so doing this would only save me one query per transaction, after I had already selected my match, and it would cost me many more to maintain it on updates. — jwalk, Apr 17 '14 at 18:33

How to optimize Cassandra model while still supporting querying by contents of lists

1 Answers1