1

I have a very large dataset comprised of data for several dozen samples, and several hundred subsamples within each sample. I need to get mean, standard deviation, confidence intervals, etc. However, im running into a (suspected) massive performance problem that causes the code to never finish executing. I'll begin by explaining what my actual code does (im not sure how much of the actual code i can share as it is part of an active research project. I hope to open-source but that will depend on IP rules in the agreement) and then i'll share some code that replicates the problem and should hopefully allow somebody a bit more well-versed in Vaex to tell me what im doing wrong!

My code currently calls the "unique()" method on my large vaex dataframe to get a list of samples, and for loops through that list of unique samples. On each loop, it uses the sample number to make an expression representing that sample (so: df[df["sample"] == i] ) and uses unique() on that subset to get a list of subsamples. Then, it uses another for-loop to repeat that process, creating an expression for the subsample and getting the statistical results for that subsample. This isnt the exact code but, in concept, it works like the code block below:

means = {}

list_of_samples = df["sample"].unique()

for sample_number in list_of_samples:
    sample = df[ df["sample"] == sample_number ]
    list_of_subsamples = sample["subsample"].unique()
    means[sample_number] = {}

    for subsample_number in list_of_subsamples:
        subsample = sample[ sample["subsample"] == subsample_number ]
        means[sample_number][subsample_number] = subsample["value"].mean()

If i try to run this code, it hangs on the line means[sample_number][subsample_number] = subsample["value"].mean() and never completes it (not within around an hour, at least) so something is clearly wrong there. To try and diagnose the issue, i have tested the mean function by itself, and in expressions without the looping and other stuff. If I run:

mean = df["value"].mean()

it successfully gives me the mean for the entire "value" column within about 45 seconds. However, if instead i run:

sample = df[ df["sample"] == 1 ]
subsample = sample[ sample["subsample"] == 1 ]
mean = subsample["value"].mean()

The program just hangs. I've left it for an hour and still not gotten a result!

How can i fix this and what am i doing wrong so i can avoid this mistake in the future? If my reading of some discussions regarding vaex are correct, i think i might be able to fix this using vaex "selections", but ive tried to read the documentation on those and cant wrap my head around how i would properly use them here. Any help from a more experienced vaex user would be greatly appreciated!

edit: In case anyone finds this in the future, i was able to fix it by using the groupby method. Im still really curious what was going wrong here, but i'll have to wait until i have more time to investigate it.

  • Hm, that is strange. At first glance your example should work. May I ask - how do you read the data? Is it hdf5 or something else? Also, can you try with a small subset of the data to see if it is something to do with usage, or with the volume of the data? Might be worth opening a github issue also if you suspect a bug. – Joco Oct 12 '21 at 14:39
  • Hey Joco. I have it in a .arrow table, with caching enabled. All being read from an NVME drive. – Cian Hughes Oct 13 '21 at 16:54
  • As for testing with a small subset: I've tried this with only part of the source data and while it is slower than i'd expect (as in, its still slower than getting the mean of the entire dataframe) its still quick. Could be because of the laziness of vaex. Perhaps its looping through the entire dataframe 3 times (once for each filter) instead of just the subsections. For now, in case anyone else finds this issue: I have solved the problem by rewriting the code to use the groupby function instead. – Cian Hughes Oct 13 '21 at 17:01

1 Answers1

0

Looping can be slow, especially if you have many groups, it's more efficient to rely on built-in grouping:

import vaex

df = vaex.example()

df.groupby(by='id', agg="mean")
# for more complex cases, can use by=['sample', 'sub_sample']
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46