0

I have a csv file with more than 13 million rows, I want to convert to hdf5: I can run code:

df_chunk = vx.from_csv(r'df.csv', nrows=20_000_000)

but if I run following code:

df_chunk.export(r'df.hdf5')

I got error:

AttributeError: 'DataFrameArrays' object has no attribute 'dtype'

same error happens when I run:

df_chunk = vx.from_csv(r'df.csv', convert='True', nrows=20_000_000)

Can you tell me what's wrong or how I can solve this. Thanks

DaveL17
  • 1,673
  • 7
  • 24
  • 38
SophieLD
  • 29
  • 6

2 Answers2

2

I tried to degrade python version to 3.7, re-install new version of Vaex(4.0), then run the code, all work without error. Thank you for all the attention and help I have gotten.

SophieLD
  • 29
  • 6
0

The error message (object has no attribute 'dtype') is interesting. dtype is a NumPy thing (it describes the data types of a NumPy array). Maybe that's a clue.

I am not familiar with vaex, so I read their documentation. :-)

I noticed you didn't use the seperator parameter (note spelling is from the docs). If your values really are comma separated, you need seperator=",".

If that doesn't work, this might help. The vaex 4.0.0-dev0 documentation shows other ways to read a CSV file and create a HDF5 file. Have you tried vx.from_ascii()? The docs show this method:

ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])

Adding the names= parameter might help with the dtype message (if compound arrays are being used). Using that example, this might work (you will have to create the names in the list:

df_chunk = vx.from_ascii('df.csv', seperator=",", names=[--add your column names here--], nrows=20_000_000)  
df_chunk.export('df.hdf5')

Note: I removed the r from the filename strings ('df.csv' instead of r'df.csv'). Not sure if that matters for this case.

kcw78
  • 7,131
  • 3
  • 12
  • 44
  • Thank you for your suggestion. but unfortunately the function I need to use is vaex.from_csv not from_ascii. I think the problem is from_csv only can convert 5million rows by default, my file is too big, I need figure out how to convert separately then combine together. – SophieLD Mar 14 '21 at 09:53
  • I did a little more reading. `vaex.from_csv` uses Pandas in the background. 13e6 rows is NOT a big dataset for Pandas or HDF5. How much RAM do you have? If memory is an issue, try the `chunk_size=` argument. If the CSV file is too large to fit into RAM, with `chunk_size=` Vaex will read the CSV in chunks, and convert each chunk to a temporary HDF5 file (then concatenated them into a single HDF5 file at the end). – kcw78 Mar 14 '21 at 14:57
  • Thank you for trying helping me. I think I finally found out the reason now, the python version which I used is 3.8 is not compatible with the latest version of vaex. I tested to use python 3.7 and got vaex 4.0 installed, then run the code, finally all work! – SophieLD Mar 14 '21 at 22:52