0
  • I want to cluster big data set (more than 1M records).
  • I want to use dbscan or hdbscan algorithms for this clustering task.

When I try to use one of those algorithms, I'm getting memory error.

  • Is there a way to fit big data set in parts ? (go with for loop and refit every 1000 records) ?
  • If no, is there a better way to cluster big data set, without upgrading the machine memory ?
Boom
  • 1,145
  • 18
  • 44

1 Answers1

1

If the number of features in your dataset is not too much (below 20-25), you can consider using BIRCH. It's an iterative method that can be used for large datasets. In each iteration it builds a tree with only a small sample of data and put each instance into clusters.

Benjamin
  • 98
  • 8