0

I want to train my word2vec models on the hpc cluster provided through my university. However, I have been told that in order to optimize storage on the cluster, I must transform my data into HDF5 and upload that data instead into the cluster. My data consists of txt files (basically the txt files I want to train word2vec on). How am I supposed to transform txt files into HDF5 ?

I am surfing the documentation but cannot seem to find a tool for txt files, or should I write a certain script ?

Perl Del Rey
  • 959
  • 1
  • 11
  • 25
  • Your focus should be on the data in your text file. What kind of data do you have, and how do you want to organize it? That will help you define the HDF5 schema/structure you need. In HDF5, groups and datasets organize your data. Once you have these defined, you will use them to save and access the data. – kcw78 Feb 18 '20 at 22:05
  • @kcw78 I have a very huge amount of txt files, each depicting a certain document that I want to train my word2vec model on. Is it a good way to put all txt files as a list of strings and create an hd5f dataset of strings out of it ? – Perl Del Rey Feb 19 '20 at 07:51
  • I work with scientific data (mostly floats), so can't comment about string data. Variable length strings take special handling when loading into HDF5. You may not see much reduction in file size. Before making a big effort, I suggest some tests with a subset of your data. – kcw78 Feb 19 '20 at 15:10

0 Answers0