1

I have a huge orderhistory table in cassandra having data from 2013, But I want only last 12 months of orderhistory data to be unloaded, I use the below command to do it which unloads all the data starting from 2013 and stores in the path data/json/customer_data/orderhistory/data. How do I modify the below statement such that each time I run this it should select only last 12months of data?

dsbulk unload -k customer_data -t crawlsiteidentifiedpages -h '172.xx.xx.xxx' \
  -c json -url data/json/customer_data/orderhistory/data
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Rahul Diggi
  • 288
  • 2
  • 16

2 Answers2

1

You need to remove options -k and -t, and instead use the -query option as described in documentation, like:

dsbulk unload -query 'select * from ks.table where <your condition>'

To make sure that unload is parallelized, make sure that your condition includes part like and token(pkcol) > :start and token(pkcol) <= :end where pkcol is the name of the partition column (if you have multiple partition columns, specify them comma-separated).

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Hello @Alex , I tried but not working: dsbulk unload -query "select * from ks.table where token(url) > :start and token(url) <= :end and lastupdatetime >= toTimestamp('2021-06-01 08:21:55.717') ALLOW FILTERING" -h '172.xx.xx.xxx, 172.xx.xx.xxx' -c json -url data/json/ks/table/data – Rahul Diggi Jul 08 '22 at 14:28
  • what doesn't work? please update your post with error – Alex Ott Jul 08 '22 at 14:35
  • I get this error: failed: Unknown CQL3 function toTimestamp called. I am Unable to figure out the exact syntax to fetch last 12 months, Help would be highly appreciated – Rahul Diggi Jul 08 '22 at 15:20
1

Instead of -t crawlsiteidentifiedpages you should use -query and provide the SELECT query, e.g.:

-query "SELECT * FROM crawlsiteidentifiedpages WHERE token(pk) > :start and token(pk) <= :end and date > maxTimeuuid('2021-06-21+0000') ALLOW FILTERING"

A few remarks:

  1. I assume your table has one partition key column pk and one clustering column date of type timeuuid – please adjust the actual query accordingly.
  2. The WHERE restriction token(pk) > :start and token(pk) <= :end allows DSBulk to parallelize the operation and improves performance.
  3. The WHERE restriction date > maxTimeuuid('2021-06-21+0000') is where the magic happens and allows you to select only the last 12 months of data.
  4. Unfortunately, you also need to add ALLOW FILTERING to this type of query, otherwise Cassandra will reject the query.
adutra
  • 4,231
  • 1
  • 20
  • 18
  • Thanks you very much, I need another help, Each time when I run this command, It creates 2 json files however small or big the data is. My question is, How do I have control over the number of files?Suppose I want to have 5 files instead of 2? – Rahul Diggi Jun 22 '22 at 16:05
  • You need to add -maxConcurrentFiles 1 – adutra Jun 24 '22 at 08:40