0

given i have the following Objects to persist in Cassandra:

Array of Foo:

{
    "id":1,
    "name": "this is a name",
    "bundleFields" : [
        "bundleByMe",
        "me2",
        "me also",
    ]
},
{
    "id":2,
    "name": "anotherName",
    "bundleFields" : [
        "bundleByMe",
        "me2",
        "me also",
    ]
},
{
    "id":3,
    "name": "thridName",
    "bundleFields" : [
        "differentBundleCriteria"
    ]
}

I wanna query something like SELECT * FROM FOO where bundleFields = ["...", "..."].

This obviously does not work, since queries by list<> are not possible (no Primarykey).

This is the Schema i currently have:

CREATE TABLE IF NOT EXISTS Foo (
    id int,
    name varchar,
    bundleFields list<varchar>,
    PRIMARY KEY(id)
);

The only solution i can imagine is another table where the PRIMARY KEY contains the concatenated values of the bundleFields-Array, which would allow a lookup by the bundleString:

CREATE TABLE IF NOT EXISTS fooByBundleString (
    bundleString varchar,
    fooId int,
    PRIMARY KEY(bundleString)
);

Is this the recomended approach to this problem in cassandra.

The idea of having to serialize/deserialize the bundleFields-array does not feel "right" to me.

Thanks for advice!


Edit: As @rs_atl suggested the correct DDL for table fooByBundleString should be (note additional fooId in PRIMARY KEY):

CREATE TABLE IF NOT EXISTS fooByBundleString (
    bundleString varchar,
    fooId int,
    PRIMARY KEY(bundleString, fooId)
);

to create a covering-Index, since otherwise it would not be possible to store the same bundleString for different fooId's.

crushervx
  • 587
  • 1
  • 7
  • 18
  • BTW: Is the technique i describe here, what people call an inverted index (or some kind of)? – crushervx Jun 03 '14 at 14:38
  • [Storing a list of values in Cassandra](http://stackoverflow.com/a/17469642/3404097) – philipxy Jun 04 '14 at 11:19
  • Thanks @philip. I already know how to store collections in cql3. But it does not solve my problem. (`Foo` contains a list `bundleFields`). I want to query a Cassandra-Table __by__ a collection (`bundleFields`). Since this is not possible, because collections can not be part of the primary key, i'm asking for a datamodel that enables such queries. – crushervx Jun 05 '14 at 10:20

1 Answers1

1

Creating an index as you've described is the correct solution. However it should be a covering index, meaning you'll want to duplicate any values you actually need returned in your query. Otherwise you'll end up doing a distributed join in your application, which is very expensive. In general, prefer denormalized data models to normalized relational models. This is essentially the same thing you have to do in your favorite RDBMS to make queries fast. The difference is you have to manage the index in your application; Cassandra won't do it for you.

rs_atl
  • 8,935
  • 1
  • 23
  • 28
  • Thanks @rs_atl I've edited my question with an updated DDL to have a covering-index. So concatenating bundleString-array in application-code and storing it in separate lookup-table is the way you suggest to go?! Thanks. I'll accept then. – crushervx Jun 05 '14 at 10:28
  • These http://stackoverflow.com/a/20070716/432710 http://stackoverflow.com/a/20434244/432710 answers look if they want to achieve the same as i'm trying. The suggested solutions look like what you suggest. – crushervx Jun 05 '14 at 10:42
  • 1
    Yes, your schema is correct. This will enable you to query all fooId's for a given bundleString. – rs_atl Jun 05 '14 at 18:29