2

I have a dataset consisting of more than 300M records each with around 800 features. I have broken the dataset into 1000 CSV files (each around 2.5Gig). I want to use UMAP to reduce the 800 dimensions space to a lower dimensions space (e.g., 10). Since I cannot load the whole dataset into the memory, I was wondering if there is any batch-learning approach for UMAP that receives each of my CSV files separately and output a single UMAP model.

  • 1
    I do not know of any, but the standard approach is to sub-sample your data and create a umap model (fit_transform), then fit each batch to the model (transform). Example here ![berman](https://github.com/bermanlabemory/motionmapperpy/blob/master/motionmapperpy/motionmapper.py) – Luka Jun 29 '23 at 22:28

0 Answers0