1

Getting error when trying to unload or count data from AWS Keyspace using dsbulk.

Error:

Operation COUNT_20221021-192729-813222 failed: Token metadata not present.

Command line:

$ dsbulk count/unload -k my_best_storage -t book_awards -f ./dsbulk_keyspaces.conf

Config:

datastax-java-driver {
  basic.contact-points = [ "cassandra.us-east-2.amazonaws.com:9142"]
  advanced.auth-provider {
        class = PlainTextAuthProvider
        username = "aw.keyspaces-at-XXX"
        password = "XXXX"
  }


basic.load-balancing-policy {
    local-datacenter = "us-east-2"
}
basic.request {
    consistency = LOCAL_QUORUM
    default-idempotence = true
}

advanced {
  request{
    log-warnings = true
  }

  ssl-engine-factory {
    class = DefaultSslEngineFactory
    truststore-path = "./cassandra_truststore.jks"
    truststore-password = "XXX"
    hostname-validation = false
  }
  metadata {
      token-map.enabled = false
  }

}
}

dsbulk load - loading operator works fine...

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
  • In short, AWS Keyspaces is not Cassandra at all (DynamoDB via proxy with very limited CQL support) so Cassandra tools aren't guaranteed to work. :( – Hades Architect Oct 23 '22 at 19:08

2 Answers2

2

I suspect the problem here is that your cluster is using the proprietary com.amazonaws.cassandra.DefaultPartitioner partitioner which most open-source tools and drivers don't recognise.

The DataStax Bulk Loader (DSBulk) tool uses the Cassandra Java driver under the hood to connect to Cassandra clusters. The Java driver uses the partitioner to determine which nodes own tokens [ranges]. Only the following Cassandra partitioners are supported:

  • Murmur3Partitioner
  • RandomPartitioner
  • ByteOrderedPartitioner

Since the Java driver doesn't know about DefaultPartitioner, it doesn't have a map of token range owners (token metadata) and so can't determine how to "split" the Cassandra ring to query the nodes.

As you already figured out, this doesn't affect the load command because it simply sends writes to coordinators and lets the coordinators figure out how the data is partitioned. But for unload and count commands which require reads, the Java driver can't determine which coordinators to pick for sub-range queries with an unsupported partitioner.

Maybe as a workaround you can try to disable token-awareness with:

$ dsbulk count [...]
  --driver.advanced.metadata.token-map.enabled false

but I don't have an AWS Keyspaces cluster I could test and I'm doubtful it will work. In any case, you're welcome to try.

There is an outstanding DSBulk feature request to provide the ability to completely disable token-awareness (internal ticket ID DAT-622) but it is unassigned at the time of writing so I'm not in a position to provide any expectation on when it will be prioritised. Cheers!

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
  • $ dsbulk count [...] --driver.advanced.metadata.token-map.enabled false this property just duplicates from config file: advanced { metadata { token-map.enabled = false } } – Mindaugas K. Oct 24 '22 at 06:49
  • Ah, I see it now in your config. Like I said, I didn't think it will work so unfortunately, I don't think there's any other solution if you're using Amazon's `DefaultPartitioner`. Cheers! – Erick Ramirez Oct 24 '22 at 07:20
1

Amazon Keyspaces now supports multiple partitioners including MurMr3Partitioner. See the following to update your partitioner. You will also want to set token-map.enabled to true.

metadata {
      token-map.enabled = true
  }

Additionally, if you are using VPC Endpoints you will need the following permissions to make sure that you will see available peers.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"ListVPCEndpoints",
         "Effect":"Allow",
         "Action":[
            "ec2:DescribeNetworkInterfaces",
            "ec2:DescribeVpcEndpoints"
         ],
         "Resource":"*"
      }
   ]
}

I would also recommend increasing the connection pool size for the data load process.

advanced.connection.pool.local.size = 3

Finally, I would recommend using AWS glue instead of DSBulk. DSBulk is single process tool and will not scale for larger data loads. Additionally, learning glue will be helpful in managing other aspects of the data lifecycle. See my example on how to unload/export data using AWS Glue.

MikeJPR
  • 764
  • 3
  • 14
  • Nice! Just with scala little complicated, better java syntax would be more preferred. – Mindaugas K. Nov 30 '22 at 08:19
  • haha I agree. I'm new to scala as well. Hopefully the examples help. For java it will be similar configuration, settings, and best practices. You will just need to compile a jar and deploy it to s3. – MikeJPR Nov 30 '22 at 14:34