How to include SQL select statement in dsbulk unload command

Question

I have a huge orderhistory table in cassandra having data from 2013, But I want only last 12 months of orderhistory data to be unloaded, I use the below command to do it which unloads all the data starting from 2013 and stores in the path data/json/customer_data/orderhistory/data. How do I modify the below statement such that each time I run this it should select only last 12months of data?

dsbulk unload -k customer_data -t crawlsiteidentifiedpages -h '172.xx.xx.xxx' \
  -c json -url data/json/customer_data/orderhistory/data

score 1 · Answer 1 · answered Jun 21 '22 at 09:36

1

You need to remove options -k and -t, and instead use the -query option as described in documentation, like:

dsbulk unload -query 'select * from ks.table where <your condition>'

To make sure that unload is parallelized, make sure that your condition includes part like and token(pkcol) > :start and token(pkcol) <= :end where pkcol is the name of the partition column (if you have multiple partition columns, specify them comma-separated).

answered Jun 21 '22 at 09:36

Alex Ott

80,552
8
87
132

Hello @Alex , I tried but not working: dsbulk unload -query "select * from ks.table where token(url) > :start and token(url) <= :end and lastupdatetime >= toTimestamp('2021-06-01 08:21:55.717') ALLOW FILTERING" -h '172.xx.xx.xxx, 172.xx.xx.xxx' -c json -url data/json/ks/table/data – Rahul Diggi Jul 08 '22 at 14:28
what doesn't work? please update your post with error – Alex Ott Jul 08 '22 at 14:35
I get this error: failed: Unknown CQL3 function toTimestamp called. I am Unable to figure out the exact syntax to fetch last 12 months, Help would be highly appreciated – Rahul Diggi Jul 08 '22 at 15:20

score 1 · Answer 2 · answered Jun 21 '22 at 09:54

Instead of -t crawlsiteidentifiedpages you should use -query and provide the SELECT query, e.g.:

-query "SELECT * FROM crawlsiteidentifiedpages WHERE token(pk) > :start and token(pk) <= :end and date > maxTimeuuid('2021-06-21+0000') ALLOW FILTERING"

A few remarks:

I assume your table has one partition key column pk and one clustering column date of type timeuuid – please adjust the actual query accordingly.
The WHERE restriction token(pk) > :start and token(pk) <= :end allows DSBulk to parallelize the operation and improves performance.
The WHERE restriction date > maxTimeuuid('2021-06-21+0000') is where the magic happens and allows you to select only the last 12 months of data.
Unfortunately, you also need to add ALLOW FILTERING to this type of query, otherwise Cassandra will reject the query.

Thanks you very much, I need another help, Each time when I run this command, It creates 2 json files however small or big the data is. My question is, How do I have control over the number of files?Suppose I want to have 5 files instead of 2? — Rahul Diggi, Jun 22 '22 at 16:05

How to include SQL select statement in dsbulk unload command

2 Answers2