1

I'm reading the following article: Elements of Scale: Composing and Scaling Data platforms

I'm stuck on understanding the following sentences:

A secondary index is an index that isn’t on the primary key. This means the data will not be partitioned by the values in the index. Directed routing via a hash function is no longer an option. We have to broadcast requests to all machines.

Can anyone explain why this is the case? I am a beginner to data platforms but have gotten so far and understand the article.

Specifically, why can't we look up values in the secondary index for their primary key, then look up their location via a hash function on that primary key? Why broadcast requests to all machines?

Thank you for your time

Community
  • 1
  • 1
Nth.gol
  • 746
  • 9
  • 20

2 Answers2

1

For the examples they are giving the data has been distributed over 4 nodes. Each node has a secondary index but only for values on that node. The secondary index does not have all records on all nodes. So a caller wanting to search would need to send to all nodes.

Eg with just 2 nodes

Node 1 has (1,a) (2,a) (3,b)

Node 2 has (100,a) (105,c)

Node 1 has a primary index 1,2,3. And a secondary index a,a,b

Node 2 has a primary index 100,105. And a secondary index a,c

A caller wanting to search for 'c' would need to broadcast to both nodes to search the two secondary indexs.

If however you maintain a complete copy of the secondary index a,a,a,b,c somewhere you could query that index and then go directly to the target node. But this has a lot more complications in practice than you might expect.

Edit 22 June. When you try an maintain a secondary index on a third node then you have the following complications to consider.

  1. Insert/edit operations now involve 2 or even 3 nodes so you need to implement a two phase commit protocol of some kind, or alternative schemes.

  2. As more nodes are involved, you may find the overall reliability decreases as the MTBF is lower.

  3. You need to consider what happens with network partitioning.

  4. Maintainance operations might be harder. How do you effectively validate an index is correct without generating too much network traffic for example.

  5. How will updates edit the index node? Are clients responsible for this, or do the main storage nodes update index nodes?

A good place to learn more is to review the CAP theorem, look into 2 phase commit schemes, and potentially look at some of the IEEE papers published in the distributed systems journal.

rlb
  • 1,674
  • 13
  • 18
0

Taking Cassandra as an example, data is written to a replica of nodes determined by the hash of the partition key (defined in the table schema, it's the first part of the primary key generally).

A secondary index is of data not in that partition key, assuming the index is written to the same node that holds the original data, when querying the secondary index you can't determine the nodes that contain data for a particular value in that index by hashing that value of the new 'key', since it lives on the node of the original partition key (primary data).

Derek Troy-West
  • 2,469
  • 1
  • 24
  • 27