Questions tagged [vaex]

Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas)

Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid for more than a billion (10^9) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

181 questions
2
votes
1 answer

vaex - create a dataframe from a list of lists

In Vaex's docs, I cannot find a way to create a dataframe from a list of lists. In pandas I would simply do pd.DataFrame([['A',1,3], ['B',2,4]]). How can this be done in Vaex?
shamalaia
  • 2,282
  • 3
  • 23
  • 35
2
votes
0 answers

Converting sparsematrix to hdf5 is taking too much time even in Vaex and memory crashes

I have a dataframe that contains text data and numerical features. I have vectorized text data and I plan to concatenate it with the remaining numerical data for running on Machine Learning algorithms. I have vectorized text data using TIDF as shown…
P H
  • 294
  • 1
  • 3
  • 16
2
votes
0 answers

groupby on very large dataset +10GB with python librairies, pandas, vaex and dask

I have more than 10 GB transaction data, i used DASK to read the data, select the columns am intrested in and also groupby the columns i wanted. All this was incredibly fast but computing wasn't working well and debugging was hard. I then decided…
amiraghrs
  • 21
  • 2
2
votes
1 answer

Performance Tips for using Vaex

I am using Vaex and looking for performance tips. My use-case is as follows: I have a large dataframe - let's call it large_df(only a few columns but tens of million rows, and in production, the dataset will be >10x as large). One of the columns…
Josh Reback
  • 529
  • 5
  • 16
2
votes
1 answer

Plot large data with vaex

I've been struggling to create a plot of a csv with millions of lines. I am trying to use the vaex module but I'm stuck.. import vaex # converts and reads large csv into hdf5 format df = vaex.open("mydir/cov2.csv", …
Ricardo Guerreiro
  • 497
  • 1
  • 4
  • 17
2
votes
2 answers

vaex: shift column by n steps

I'm preparing a big multivariate time series data set for a supervised learning task and I would like to create time shifted versions of my input features so my model also infers from past values. In pandas there's the shift(n) command that lets you…
sobek
  • 1,386
  • 10
  • 28
2
votes
0 answers

How to json normalize columns in vaex?

Given a nested json, is there a way to load and flatten it in vaex? This is a way to do it in pandas: import pandas as pd from pandas.io.json import json_normalize df = pd.read_json(input_file) df = pd.concat([df, json_normalize(df['eventData'])],…
scc
  • 10,342
  • 10
  • 51
  • 65
2
votes
1 answer

Workflow for modifying an hdf5 file in vaex

As sort of follow on to my previous question [1], is there a way to open a hdf5 dataset in vaex, perform operations and then store the results to the same dataset? I tried the following: import vaex as vx vxframe = vx.open('somedata.hdf5') vxframe…
sobek
  • 1,386
  • 10
  • 28
2
votes
1 answer

Columns not showing in Hdf5 file

I have a large data set (1.3 billion data) that i want to visualize with Vaex. Since the data set was very big in csv (around 130gb in 520 separate file), i merged them in a hdf5 file with pandas dataframe.to_hdf function (format:table, appended for…
Olca Orakcı
  • 372
  • 3
  • 12
1
vote
1 answer

Most efficient way of computing pairwise cosine similarity for large DataFrame

I have a 300.000 row pd.DataFrame comprised of multiple columns, out of which, one is a 50-dimension numpy array of shape (1,50) like so: ID Array1 1 [2.4252 ... 5.6363] 2 [3.1242 ... 9.0091] 3 …
Johnny
  • 117
  • 10
1
vote
1 answer

very large JSON handling in Python

I have a very large JSON file (~30GB, 65e6 lines) that I would like to process using some dataframe structure. This dataset does of course not fit into my memory and therefore I ultimately want to use some out-of-memory solution like dask or vaex. I…
1
vote
1 answer

Multi-columns filter VAEX dataframe, apply expression and save result

I want to use VAEX for lazy work wih my dataframe. After quick start with export big csv and some simple filters and extract() I have initial df for my work with 3 main columns: cid1, cid2, cval1. Each combitations of cid1 and cid2 is a workset with…
Jahspear
  • 151
  • 11
1
vote
1 answer

An accurate progress bar for loading files and transforming data using Vaex and Pandas

I am looking for the method to include a progress bar to see the remaining time for loading a file with Vaex (big data files) or transform big data with Panda. I have checked this thread…
1
vote
1 answer

Columns not recognized when importing HDF5 file

I am trying to import an HDF5 file in python. I do not have details how the file was written. Therefore, I tried vaex and pandas to open it. How can I specify my columns, so that they are recognized? I tried to check the structure of the file…
luki
  • 111
  • 5
1
vote
1 answer

How to write a large .txt file to a csv for Biq Query dump?

I have a dataset that is 86 million rows x 20 columns with a header, and I need to convert it to a csv in order to dump it into big query (adding multiple tags from that). The logical solution is reading the .txt file with pd.read_csv but I don't…
birdman
  • 249
  • 1
  • 13
1 2
3
12 13