0

I would like to join thousands of dataframes into one VAEX dataframe Following the documentation I have: https://vaex.readthedocs.io/en/latest/api.html?highlight=concat#vaex.concat

I do:

df_vaex = vaex.DataFrame()
for i,file in enumerate(files):
    df = pd.read_pickle(file)
    df_vx = vaex.from_pandas(df=df, copy_index=False)
    df_vaex.concat(df_vx)
    if i%100 == 0:
        print(i)

this does not work.

How can I read and concatenate dataframes in vaex?

I get the error that vaex does not have the method concat: AttributeError: 'DataFrame' object has no attribute 'concat'

enter image description here

Second try following the first comment:

for i,file in enumerate(files):
    df = pd.read_pickle(file)
    df_vaex_total = vaex.from_pandas(df=df, copy_index=False)
    if i == 0:
        pass
    else:
        print(type(df_vaex_total)) # its equal to <class 'vaex.dataframe.DataFrameLocal'>
        print(type(df_vx)) # its equal to <class 'vaex.dataframe.DataFrameLocal'>
        
        df_vaex_total = pd.concat([df_vaex_total, df_vx])
        
    if i%10 == 0:
        print(i)

error: TypeError: cannot concatenate object of type '<class 'vaex.dataframe.DataFrameLocal'>'; only Series and DataFrame objs are valid

JFerro
  • 3,203
  • 7
  • 35
  • 88

1 Answers1

0

If you want to use vaex to concat dataframes you need to do it in the following way:

  • read in all dataframes first
  • create a list of dataframes
  • use df_final = vaex.concat(list_of_dataframes)

So your code would look something like this:

list_of_dataframes = []

for i, file in enumerate(files)
    pdf = pd.read_pickle(file)
    df = vaex.from_pandas(pdf)
    list_of_dataframes.append(df)

df_final = vaex.concat(list_of_dataframes)
Joco
  • 803
  • 4
  • 7
  • Thanks. But that means that all the dataframes have to be in memory once, and in my case it is not possible. That is the very reason I am using Vaex, i.e. all the dataframes together do not fit in the memory of the computer (120GB). – JFerro Oct 22 '22 at 21:14
  • 1
    Well in that case, convert each individual pickle file to hdf5 via vaex. So instead of using concat, in the loop do `df.export_hdf5(f'part_{i}.hdf5')`. After the loop is finished you can open all hdf5 files as a single dataframe wither via `df = vaex.open('part*.hdf5')` or via `df = vaex.open_many(list_of_paths)` – Joco Oct 23 '22 at 01:26
  • I guess I should open another question since the original files are text files. I know I can not read all of them in one pandas data frame. So I have to ead every one individually and save it as hdf5. Is is possible to read csv space separated values and concatenate vaex dfs on the fly? It's to avoid making the loop two times. – JFerro Oct 23 '22 at 12:03
  • if the original data is in CSV format, with the latest version of vaex you can lazily read csv files! Regardless of how big the file is it will not be read in memory. So for simple things, you can just work with it, but if you want better performance, you can convert to hdf5 right away ,no loops needed. If the CSV is not too funky, you can do something like `df = vaex.open(my_file.csv, convert=True)` – Joco Oct 23 '22 at 16:28