Questions tagged [data-compaction]

13 questions
1
vote
2 answers

Directory size increased after compaction using pyspark

I wrote a file compactor using pyspark. The way that it works is by reading all the content of a directory into a spark dataframe and then performing a repartition action in order to reduce the number of files. The number of desired files is…
1
vote
0 answers

Using multiple TTL values in Cassandra table

What are the disadvantages of using multiple TTL values(One in table level and another for specific data rows to override the TTL for those rows) in Cassandra table.Will it result into incomplete data cleanup? As in TWCS being used,we may never get…
0
votes
2 answers

Data in hive table is changed after running a compaction in pyspark

Following previously asked question adding link. in short: I wrote a file compactor in spark, the way that it works is by reading all files under a directory into a dataframe, performing coalesce over the dataframe (by the number of wanted files),…
0
votes
0 answers

Can I run compaction in multiple graph spaces in the NebulaGraph database?

I'm running Nebula Graph database on AWS with the Twitter dataset (3 graph spaces), and each space has a data volume of around 500GB. I know that the compaction process is quite time-consuming. Can I run compaction for all 3 graph spaces at the same…
randomv
  • 218
  • 1
  • 7
0
votes
0 answers

Kafka - consuming messages from topic while removing duplicates

I'm going to consume a Kafka topic with log.cleanup.policy=compact. There will be many consumers/producers that concurrently will read/write the topic. I want to be sure that the consumers while reading messages from the topic all the duplicates,…
freedev
  • 25,946
  • 8
  • 108
  • 125
0
votes
1 answer

Kafka - changing log.cleanup.policy to existing topic

I have a Kafka topic that receives many many messages. Many of them have the same key and I'm interested only in the latest messages. Looking around this topic seems perfect for the config log.cleanup.policy=compact. Can I change the existing Kafka…
freedev
  • 25,946
  • 8
  • 108
  • 125
0
votes
1 answer

Does etcd's storage footprint grow linearly with respect to keys and values?

I noticed that, when running some stress tests on a Kubernetes cluster, etcd snapshot sizes didnt increase much, even as I added more and more stuff to my cluster. I collected snapshots via: etcdctl --endpoints="https://localhost:2379"…
jayunit100
  • 17,388
  • 22
  • 92
  • 167
0
votes
1 answer

rocksdb all compaction jobs done notification

I use rocksdb's bulk loading mechanism to load a bunch of sst files generated by offline spark tasks. In order to avoid a large number of disk IO during the loading and compacting process from affecting online read requests, I want to finish offline…
0
votes
1 answer

CouchDB 3.2 disable auto compaction for a specific database

How can I disable auto compaction in couchdb 3.2? I want to preserve all the history for a specific database. Or completely disable auto compaction. note) couchdb(3.2) configuration has changed from 2.0
Zeta
  • 913
  • 10
  • 24
0
votes
1 answer

How to free disk space from Cassandra when a lot of tombstones have collected in sizeTieredCompaction strategy

I am running cqlsh version 5.0.1, having a 6 node cluster, where recently I have done a major data cleanup in a table that uses sizeTieredCompaction strategy in order to free some disk space but that didn't happen, the issue that I am facing is that…
Yash Tandon
  • 345
  • 5
  • 18
0
votes
1 answer

hbase: For major compaction config does not take effect

I have do the config :habse.offpeak.end.hour:22 ,hbase.offpeak.start.hour: 18.hbase.hregion.majorcompaction: 86400000.but hbase still do major compaction in random time ,like:9:00 ,13:55 and so on. can you tell me how to config hbase do major…
0
votes
1 answer

How to remove old revisions of the documents in a couchdb database?

I have a very large database with some GB of data. And when I try to compact it's taking me more than 12 hours. Is there any other way to delete old revisions? Does the _revs_limit help in this. I can see that the revs limit of all databases is set…
Rahib Rasheed
  • 317
  • 1
  • 10
-1
votes
1 answer

Which compaction strategy is recommended for a table with minimal updates

I am looking for compaction strategy for the data which has following characteristics We don't need the data after 60-90 days. At extreme scenarios maybe 180 days. Ideally insert happens and updates never happens but it is realistic to expect…