I am trying to use Python's multiprocessing library to speed up some code I have. I have a dictionary whose values need to be updated based on the result of a loop. The current code looks like this:
def get_topic_count():
topics_to_counts = {}
for news in tqdm.tqdm(RawNews.objects.all().iterator()):
for topic in Topic.objects.filter(is_active=True):
if topic.name not in topics_to_counts.keys():
topics_to_counts[topic.name] = 0
if topic.name.lower() in news.content.lower():
topics_to_counts[topic.name] += 1
for key, value in topics_to_counts.items():
print(f"{key}: {value}")
I believe the worker function should look like this:
def get_topic_count_worker(news, topics_to_counts, lock):
for topic in Topic.objects.filter(is_active=True):
if topic.name not in topics_to_counts.keys():
lock.acquire()
topics_to_counts[topic.name] = 0
lock.release()
if topic.name.lower() in news.content.lower():
lock.acquire()
topics_to_counts[topic.name] += 1
lock.release()
However, I'm having some trouble writing the main function. Here's what I have so far but I keep getting a process killed message I believe it's using too much memory.
def get_topic_count_master():
topics_to_counts = {}
raw_news = RawNews.objects.all().iterator()
lock = multiprocessing.Lock()
args = []
for news in tqdm.tqdm(raw_news):
args.append((news, topics_to_counts, lock))
with multiprocessing.Pool() as p:
p.starmap(get_topic_count_worker, args)
for key, value in topics_to_counts.items():
print(f"{key}: {value}")
Any guidance here would be appreciated!
Update: There are about 1.6 million records that it needs to go through. How would I chunk this properly?
Update 2: Here's some sample data:
Update 3:
Here is the relation in the RawNews table:
topics = models.ManyToManyField('Topic', blank=True)