0

I am running a BERTopic model on tweets, I have 140k tweets to analyze. So far, if I run this on more than 15k lines, I get the following below. I have Joblib version: 1.2.0 and Loky version: 3.3.0 installed, and I'm using miniconda and Python 3.9. I'm working on a macbook with M2, on Ventura 13.2.1.

Code:

umap_model = UMAP(n_neighbors=15,n_components=5,min_dist=0.0,metric='cosine',random_state=100,low_memory=True)
#hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)  hdbscan_model=hdbscan_model,

Initiate BERTopic

topic_model = BERTopic(umap_model=umap_model, language="english", calculate_probabilities=True)

Run BERTopic model

dfred = df[0:20000]
topics, probabilities = topic_model.fit_transform(dfred['text'])

Error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'loky'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'joblib'
/Users/x/miniconda3/envs/spyderenv/bin/python: Error while finding module specification for 'loky.backend.popen_loky_posix' (ModuleNotFoundError: No module named 'loky')
/Users/x/miniconda3/envs/spyderenv/bin/python: Error while finding module specification for 'loky.backend.popen_loky_posix' (ModuleNotFoundError: No module named 'loky')
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'loky'
/Users/x/miniconda3/envs/spyderenv/bin/python: Error while finding module specification for 'loky.backend.popen_loky_posix' (ModuleNotFoundError: No module named 'loky')
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'loky'
Traceback (most recent call last):

  Cell In[43], line 3
    topics, probabilities = topic_model.fit_transform(dfred['text'])

  File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/bertopic/_bertopic.py:359 in fit_transform
    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)

  File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/bertopic/_bertopic.py:2903 in _cluster_embeddings
    self.hdbscan_model.fit(umap_embeddings, y=y)

  File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/hdbscan/hdbscan_.py:1190 in fit
    ) = hdbscan(clean_data, **kwargs)

  File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/hdbscan/hdbscan_.py:822 in hdbscan
    (single_linkage_tree, result_min_span_tree) = memory.cache(

  File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/joblib/memory.py:349 in __call__
    return self.func(*args, **kwargs)

  File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/hdbscan/hdbscan_.py:325 in _hdbscan_boruvka_kdtree
    alg = KDTreeBoruvkaAlgorithm(

  File hdbscan/_hdbscan_boruvka.pyx:392 in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__

  File hdbscan/_hdbscan_boruvka.pyx:426 in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds

  File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/joblib/parallel.py:1098 in __call__
    self.retrieve()

  File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/joblib/parallel.py:975 in retrieve
    self._output.extend(job.get(timeout=self.timeout))

  File ~/miniconda3/envs/spyderenv/lib/python3.9/site-packages/joblib/_parallel_backends.py:567 in wrap_future_result
    return future.result(timeout=timeout)

  File ~/miniconda3/envs/spyderenv/lib/python3.9/concurrent/futures/_base.py:446 in result
    return self.__get_result()

  File ~/miniconda3/envs/spyderenv/lib/python3.9/concurrent/futures/_base.py:391 in __get_result
    raise self._exception

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {EXIT(1)}

I have tried uninstalling and installing both Joblib and Loky, as well as different versions. I also set the umap model to low_memory=true. I understand I might not be able to run the model on all the data at once, but I would like to be able to do it with more rows at once.

TSD
  • 31
  • 1
  • 4

0 Answers0