0

I'm trying to make an application using the Python driver for Cassandra in which I want to store car objects in the database.

I've been reading a lot about Cassandra partition keys, clustering keys, indexes etc. but I can't figure out what would be the best schema for my situation.

Situation: The 'Car' objects contain 20 fields, of which I want to perform a single query for car_uuid (unique per car), factory_uuid (which factory made the car) and customer_uuid (who bought the car, this value is never null).

  • Based on what I read about partition keys, having a high cardinality partition key is preferred, hence it would make sense to choose car_uuid for this due to each car having a unique car_uuid. However if I now want to query for all cars from a certain factory_uuid I get an error that I need to enable ALLOW FILTERING since car_uuid is the first partition key and should always be specified when trying to query on the secondary partition key (if I have understood it correctly).

Most of my queries will either be getting all cars made by a certain factory (factory_uuid), or retrieving information about a single car (using car_uuid)

  • So I'm thinking of setting factory_uuid as first partition key as, for my particular application, I know I'll always have the factory_uuid when querying for a specific car_uuid. However now the main partition key does not have as high cardinality, as there are only that many factories, so in theory my data will not be a properly distributed anymore. This raised some question:
    1. To what extend will I notice that impact of using a low(er) cardinality primary partition key?
    2. In some posts people made a suggestion to use lookup tables. Would this be a solution for my situation, and which would be better: using car_uuid in the main table and factory_uuid in the lookup table as primary partition keys? (My intuition says that one of these tables will then still have this low cardinality primary partition key problem, is this true?)
    3. Just to be sure, as far as I understand, indexes are not not desirable to use for factory_uuid since, while trying to get all cars for a certain factory_uuid, Cassandra will go over all nodes to check for matching results, which will have major performance implications on such a frequent query.

Any suggestions / inputs are welcomed!

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
Leander
  • 854
  • 4
  • 17

0 Answers0