I'm trying to make an application using the Python driver for Cassandra in which I want to store car
objects in the database.
I've been reading a lot about Cassandra partition keys, clustering keys, indexes etc. but I can't figure out what would be the best schema for my situation.
Situation: The 'Car' objects contain 20 fields, of which I want to perform a single query for car_uuid
(unique per car
), factory_uuid
(which factory made the car
) and customer_uuid
(who bought the car
, this value is never null
).
- Based on what I read about partition keys, having a high cardinality partition key is preferred, hence it would make sense to choose
car_uuid
for this due to eachcar
having a uniquecar_uuid
. However if I now want to query for all cars from a certainfactory_uuid
I get an error that I need to enable ALLOW FILTERING sincecar_uuid
is the first partition key and should always be specified when trying to query on the secondary partition key (if I have understood it correctly).
Most of my queries will either be getting all cars made by a certain factory (factory_uuid
), or retrieving information about a single car (using car_uuid
)
- So I'm thinking of setting
factory_uuid
as first partition key as, for my particular application, I know I'll always have thefactory_uuid
when querying for a specificcar_uuid
. However now the main partition key does not have as high cardinality, as there are only that many factories, so in theory my data will not be a properly distributed anymore. This raised some question:- To what extend will I notice that impact of using a low(er) cardinality primary partition key?
- In some posts people made a suggestion to use lookup tables. Would this be a solution for my situation, and which would be better: using
car_uuid
in the main table andfactory_uuid
in the lookup table as primary partition keys? (My intuition says that one of these tables will then still have this low cardinality primary partition key problem, is this true?) - Just to be sure, as far as I understand, indexes are not not desirable to use for
factory_uuid
since, while trying to get all cars for a certainfactory_uuid
, Cassandra will go over all nodes to check for matching results, which will have major performance implications on such a frequent query.
Any suggestions / inputs are welcomed!