Is it faster to do two lookups on primary keys or one lookup on a non-primary key?

Question

I'm creating a table in cassandra for users. Users have both a unique user_id and a unique display_name.

My table currently resembles:

create table user (user_id text primary key,
    display_name text,
    joined timestamp,
    last_seen timestamp,
    ...);

When a new user creates an account I need to check if the display name they enter is already someone else's. What's the fastest way I can do this?

This project is primarily a learning project for me, I want to experiment with some NoSQL concepts in a semi-real-world situation.

My own thoughts are that using two tables like this:

create table user (user_id text primary key,
    display_name text,
    joined timestamp,
    last_seen timestamp,
    ...);

create table user_by_display_name (display_name text primary key, user_id text);

And then looking up in user_by_display_name for existence of a username will be faster than looking up in user where display_name = ?. Is this a correct assumption?

I feel that two lookups on primary keys, user_by_display_name where display_name = ? and then another lookup using the user_id in user might be slower than the single user where display_name = ? lookup if I need the information about that user.

Make user_id PK and use a unique constraint on the display_name. — Gurwinder Singh, Nov 13 '17 at 08:11
@GurwinderSingh I don't think cassandra has constraints, and I thought constraints were computationally expensive anyway? — theonlygusti, Nov 13 '17 at 08:29

score 0 · Answer 1 · answered Nov 13 '17 at 18:24

Your assumption is correct, you will want to denormalize as described. You design your queries first and then build the tables.

Your table should look like

CREATE TABLE users (
  user_id text, 
  display_name text,
  last_seen timestamp,
  PRIMARY KEY(user_id)
WITH ...);

CREATE TABLE users_by_display_name (
  display_name text, 
  user_id text, 
  last_seen timestamp,
  PRIMARY KEY(display_name);

This way you do 1 lookup. Writes are cheap, reads are expensive. Avoid reads when possible. That said, your partitions are 1 row and you're doing a lookup in a Key value database. Your second select wouldn't be terribly expensive either.

You could create a second table as described. Depending on the version of cassandra you're running alternatively you could also create a MATERIALIZED VIEW from your user table to achieve this same lookup. These are something you wouldn't want to use if you're updating the last_seen property frequently as they'll do delete's when rows are updated. So maybe limit it to just display->userId.

Once you have the lookup setup you still have to worry about race conditions where two users steal the same username. In this case you might consider Light weight transactions. That said, you want to avoid doing these frequently as they are expensive.

LWT do come with some issues though, like they're not "true" transactions and they're slower than a standard read/write. Just make sure you know what you're getting.

what command should I use to perform the lookup for maximum performance? — theonlygusti, Nov 13 '17 at 20:50
If you're looking to check if the user_name exists it would be: `SELECT FROM users_by_display_name WHERE display_name = 'x';` Its a key value lookup and its 1 row it will be fast. Alternatively why do you need a users table that has a primary key of user_id? The user will be logging in with a display name anyways. You could let cassandra create a guid. — Highstead, Nov 14 '17 at 16:07
Just going to update this, apparently `MATERIALIZED VIEWS` are deprecated in 4.0... so maybe don't use those. — Highstead, Nov 14 '17 at 19:38

Is it faster to do two lookups on primary keys or one lookup on a non-primary key?

1 Answers1