0

I am converting 100 csv files into dataframes and storing them in an HDFStore.

What are the pros and cons of

a - storing the csv file as 100 different HDFStore files?

b - storing all the csv files as separate items in a single HDFStore?

Other than performance issues, I am asking the question as I am having stability issues and my HDFStore files often get corrupted. So, for me, there is a risk associated with a single HDFStore. However, I am wondering if there are benefits to having a single store.

Anto
  • 6,806
  • 8
  • 43
  • 65
Ginger
  • 8,320
  • 12
  • 56
  • 99

1 Answers1

1

These are the differences:

multiple files

  1. when using multiple files you can only corrupt a single file when writing (eg you have a power failure when writing)
  2. you can parallelize writing with multiple files (note - never, ever try to parallelize with a single file a this will corrupt it!!!)

single file

  1. grouping if logical sets

IMHO the advantages of multiple files outweigh using a single file as you can easily replicate the grouping properties by using sub directories

Jeff
  • 125,376
  • 21
  • 220
  • 187