I am running a BERTopic model on tweets, I have 140k tweets to analyze. So far, if I run this on more than 15k lines, I get the following below. I have Joblib version: 1.2.0 and Loky version: 3.3.0 installed, and I'm using miniconda and Python 3.9. I'm working on a macbook with M2, on Ventura 13.2.1.
Code:
umap_model = UMAP(n_neighbors=15,n_components=5,min_dist=0.0,metric='cosine',random_state=100,low_memory=True)
#hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True) hdbscan_model=hdbscan_model,
Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, language="english", calculate_probabilities=True)
Run BERTopic model
dfred = df[0:20000]
topics, probabilities = topic_model.fit_transform(dfred['text'])
Error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'loky'
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'joblib'
/Users/x/miniconda3/envs/spyderenv/bin/python: Error while finding module specification for 'loky.backend.popen_loky_posix' (ModuleNotFoundError: No module named 'loky')
/Users/x/miniconda3/envs/spyderenv/bin/python: Error while finding module specification for 'loky.backend.popen_loky_posix' (ModuleNotFoundError: No module named 'loky')
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'loky'
/Users/x/miniconda3/envs/spyderenv/bin/python: Error while finding module specification for 'loky.backend.popen_loky_posix' (ModuleNotFoundError: No module named 'loky')
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'loky'
Traceback (most recent call last):
Cell In[43], line 3
topics, probabilities = topic_model.fit_transform(dfred['text'])
File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/bertopic/_bertopic.py:359 in fit_transform
documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/bertopic/_bertopic.py:2903 in _cluster_embeddings
self.hdbscan_model.fit(umap_embeddings, y=y)
File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/hdbscan/hdbscan_.py:1190 in fit
) = hdbscan(clean_data, **kwargs)
File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/hdbscan/hdbscan_.py:822 in hdbscan
(single_linkage_tree, result_min_span_tree) = memory.cache(
File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/joblib/memory.py:349 in __call__
return self.func(*args, **kwargs)
File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/hdbscan/hdbscan_.py:325 in _hdbscan_boruvka_kdtree
alg = KDTreeBoruvkaAlgorithm(
File hdbscan/_hdbscan_boruvka.pyx:392 in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__
File hdbscan/_hdbscan_boruvka.pyx:426 in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/joblib/parallel.py:1098 in __call__
self.retrieve()
File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/joblib/parallel.py:975 in retrieve
self._output.extend(job.get(timeout=self.timeout))
File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/joblib/_parallel_backends.py:567 in wrap_future_result
return future.result(timeout=timeout)
File ~/miniconda3/envs/spyderenv/lib/python3.9/concurrent/futures/_base.py:446 in result
return self.__get_result()
File ~/miniconda3/envs/spyderenv/lib/python3.9/concurrent/futures/_base.py:391 in __get_result
raise self._exception
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {EXIT(1)}
I have tried uninstalling and installing both Joblib and Loky, as well as different versions. I also set the umap model to low_memory=true. I understand I might not be able to run the model on all the data at once, but I would like to be able to do it with more rows at once.