2

Observed a behaviour associated with fuseki that, even after dropping the graphs from a fuseki dataset (using DROP GRAPH command), the actual size of the folder "run/databases" is not decreasing. Recently read about the backup and restore mechanism to solve this issue, and just wanted to know if any alternative approach is also available for the same. Also, Is this size issue happens in fuseki 3x versions? I've observed this in fuseki 2.4.0 version.

Thanks in advance!

Bhavya
  • 45
  • 5
  • 1
    it depends on the TDB version you chose, TDB1 or TDB2 - but the node table isn't pruned during DELETE directly. So reload the data would result in the most compact disk space usage. For example `tdbloader2` creates a very compact B+ tree, later changes make it more fragmented on disk. By the way, latest version is `4.2.0` ... – UninformedUser Nov 12 '21 at 20:19
  • I seem to be having the same issue, my db from 2.3 was at 50GB, and then I reloaded into 4.3 and it blew up to 180GB – Jeff Dec 15 '21 at 01:50

1 Answers1

3

This answer relates to the current Apache Jena Fuseki - version 4.2.0.

TDB2 has a compaction tool (only run this when Fuseki is not running). tdb2.compact

Or, depending on your setup, curl -XPOST http://server:port/$/compact/<datasetname> will compact a database in a running server.

AndyS
  • 16,345
  • 17
  • 21
  • Thanks @AndyS - compaction helped me in decreasing the database size. Now I have two databases, the compact one, which is live and the another one, which is not going to be used anymore and can be deleted. So everytime when I perform a compact operation, do I need to delete the old database folder manually or do we have any command to take care of this deletion? Also, I've upgraded fuseki from 2x to 4x version, and observed each update operation (including DROP GRAPH command) gradually increases the database size. May I know, why this is happening in 4x version and not in 2x ? – Bhavya Nov 15 '21 at 11:33
  • 2
    It happens in all versions. Can't explain why you didn't see it in 2.4.0 unless that version was not providing full transaction features (2.4.0 was a long time ago). You can delete the old database, or zip it up and delete it.(On MS windows, it won't free the disk space completely because of a long term Java/Windows issue in the JDK.) – AndyS Nov 15 '21 at 15:51
  • I get an error when running curl -XPOST http://server:port/$/compact/ on 4.3: :: Task : 1 : Compact 19:01:54 INFO Server :: [Task 1] starts : Compact 19:01:54 INFO Compact :: [33772] >>>> Start compact /ds 19:01:58 WARN Compact :: [33772] **** Exception in compact org.apache.jena.tdb2.TDBException: NodeTableTRDF/Read – Jeff Dec 15 '21 at 01:55
  • Not enough of the error and stack trace showing. Look down the stacktrace for the "caused by". There may be several, the last, furthest down, is usually the most relevant. – AndyS Dec 15 '21 at 08:32
  • Thanks @AndyS: Here's one `Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized type 0` Here's the full print out: https://gist.github.com/jeffreycwitt/e7c270aae46f403845c87aa57e4b82af – Jeff Dec 15 '21 at 13:37
  • 1
    Looks like the database is corrupt in some way. Not related to the compaction (this ticket) but already bad. Try a backup but I think another process interfere at some time in the past. – AndyS Dec 15 '21 at 20:59
  • @AndyS Rebuilt it more slowly and succeeded! Many thanks. – Jeff Dec 18 '21 at 18:01
  • @AndyS I've tried the backup and restore method to clear the unused space in fuseki.. It works as expected if only few datasets are there in fuseki, but when I tried the same for about 200+ datasets, using an iterative script, it is always getting stuck after completing the backup of around 20+ datasets(nearly 1-1.5GB). Any specific reason for this? Any constraint on the total folders size that the run/backup folder can accommodate? – Bhavya Jan 13 '22 at 03:18
  • There are no built-in limits. Obvious OS limits apply, see `ulimit` settings. But generally these cause errors not stalling. Is it always the same place? Is there enough disk space? Are you trying to do in parallel or one at a time? The other cause of stalling effects is that the JVM is trying to GC and is very close to running out. This can be seen as high CPU with multiple threads at 100%. – AndyS Jan 13 '22 at 10:13