Get distinct partition keys from C* table

Question

cqlsh doesn't allow nested queries so I cant export selected data to csv.. I'm trying to export the selected data (about 200,000 rows with a single column) from cassandra using:

echo "SELECT distinct imei FROM listener.snapshots;" > select.cql bin/cqlsh -f select.cql > output.txt

and it just stuck forever without any error, and the file isn't growing.

if I use strace on the last line I got many rows like:

select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)

and --debug give me only:

cqlsh --debug -f select.cql > output.txt

Using CQL driver: <module 'cassandra' from '/usr/share/dse/resources/cassandra/bin/../lib/cassandra-driver-internal-only-2.5.1.zip/cassandra-driver-2.5.1/cassandra/__init__.py'>

what is wrong? Is there better way to get distinct partition keys from large C* table?

Try strace to see where it gets stuck. cqlsh is probably going to set a limit to your query by default so it might not be the best way to do this. — phact, Dec 24 '15 at 15:17
Maybe try cqlsh --debug. I'm not sure why there aren't file descriptors in your strace select https://gist.github.com/tonyc/1384523 — phact, Dec 24 '15 at 17:01

score 5 · Accepted Answer · 2015-12-29T14:51:57.180

I used capture:

cqlsh> CAPTURE 'temp.csv'                                              
Now capturing query output to 'temp.csv'.
cqlsh> SELECT distinct imei FROM listener.snapshots;
---MORE---
---MORE---
---MORE---
---MORE---
.
.
.
cqlsh> 
cqlsh>

And press enter until it finished.

Even faster option is to use paging:

cqlsh> PAGING off
Disabled Query paging.
cqlsh> CAPTURE 'temp.csv'                                              
Now capturing query output to 'temp.csv'.
cqlsh> SELECT distinct imei FROM listener.snapshots;

It would immediately extract the data to the file (if you get a OperationTimedOut you should edit the timeout settings in cassandra.yaml).

I cant believe that it is the fasts way there is... I know I can export data using spark by using CassandraSQLContext but its not so fast when I need to create the rdd querying C* for distinct column out of very large table(2B rows~), and print them to file:

    val conf = new SparkConf().setAppName("ExtractDistinctImeis")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    val connector = CassandraConnector(conf)
    val cc = new CassandraSQLContext(sc)

    val snapshots_imeis = cc.sql("select distinct imei from listener.snapshots").map(row => row(0).toString)

    val imeis = snapshots_imeis.collect 

    def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit) {
        val p = new java.io.PrintWriter(f)
        try { op(p) } finally { p.close() }
    }

    printToFile(new File("/path/to/file.txt")) { p => imeis.foreach(p.println) }

It took 3.5 hours with spark! With capture I manage to get my file after 3 min/3 sec.

score 0 · Answer 2 · answered Dec 26 '15 at 21:46

0

Usually you would use the cqlsh "COPY ... TO ..." command to export data from a table to a csv file.

See the documentation here.

I'm not sure why what you're trying is getting stuck. To debug I'd suggest not using DISTINCT and adding a LIMIT clause. It may be your table is large and so cqlsh is paging the results, but there is no one there to press return to get to the next page of results, so it waits forever.

answered Dec 26 '15 at 21:46

Jim Meyer

9,275
1
24
49

how can I use 'copy to' without nested queries? I want to get distinct partition keys of the table, not all of them, so- "COPY listener.snapshots (imei) TO 'temp.csv';" isn't enough. – Dec 27 '15 at 12:43
you were right about why it got stck- I tried with limit and it worked. Still does not know how to get distinct partition keys. – Dec 27 '15 at 13:30
I'd use "sort -u" to remove duplicates after you have created the csv file. – Jim Meyer Dec 27 '15 at 13:31
but what if the table is very large? exporting it all to csv and then sorting, is the only way to get the distinct partitions keys of it? there is no better and simple way to do it? – Dec 27 '15 at 13:40
1

It looks like cqlsh has an option to disable paging. Try adding "PAGING OFF" as the first line of your cqlsh command file. – Jim Meyer Dec 27 '15 at 13:44
1

Don't put parenthesis around OFF. It's just "PAGING OFF;" with no parenthesis. This should generate the message, "Disabled Query paging." – Jim Meyer Dec 27 '15 at 13:59
I edit my file to: "PAGING OFF; SELECT distinct imei FROM listener.snapshots;" and when I use it it shows "select.cql:3:OperationTimedOut: errors={}" so it didn't work. :/ any ideas? – Dec 27 '15 at 14:26

Get distinct partition keys from C* table

2 Answers2

Linked