I am trying to understand the discrepancy between the sizes of my raw S3 file and the volume of Neptune as I load it. I am testing a small percentage of my original graph (~15%, only vertices), in which the raw CSV size is 3.1GB (no compression) but when it loads into Neptune, it appears to be 59.6GB. I understand there is 10GB of size that is dynamically added, but even so, I feel 50GB+ is excessive as a result given my initial dataset. This is a brand new cluster.
For my test, I just have 4 properties (2 strings, 2 integers) with single cardinality. I have 90 million vertices (no edges, just testing the delta in volume). My true scenario is 600+ million vertices and probably 2x that for edges. When we load the entire dataset, we are approaching 2TB of data and performance issues start to arise (having to go to the volume storage, no cache).
Is there documentation, similar to what DynamoDB has, around the size estimates when it comes to properties, etc.? I want to take these into account when designing a new data model or data fetching strategy.
Dynamo link: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/CapacityUnitCalculations.html
Thanks!