0

I am trying to use Python's multiprocessing library to speed up some code I have. I have a dictionary whose values need to be updated based on the result of a loop. The current code looks like this:

def get_topic_count():
    topics_to_counts = {}
    for news in tqdm.tqdm(RawNews.objects.all().iterator()):
        for topic in Topic.objects.filter(is_active=True):
            if topic.name not in topics_to_counts.keys():
                topics_to_counts[topic.name] = 0
            if topic.name.lower() in news.content.lower():
                topics_to_counts[topic.name] += 1
    for key, value in topics_to_counts.items():
        print(f"{key}: {value}")

I believe the worker function should look like this:

def get_topic_count_worker(news, topics_to_counts, lock):
    for topic in Topic.objects.filter(is_active=True):
        if topic.name not in topics_to_counts.keys():
            lock.acquire()
            topics_to_counts[topic.name] = 0
            lock.release()
        if topic.name.lower() in news.content.lower():
            lock.acquire()
            topics_to_counts[topic.name] += 1
            lock.release()

However, I'm having some trouble writing the main function. Here's what I have so far but I keep getting a process killed message I believe it's using too much memory.

def get_topic_count_master():
    topics_to_counts = {}
    raw_news = RawNews.objects.all().iterator()
    lock = multiprocessing.Lock()
    args = []
    for news in tqdm.tqdm(raw_news):
        args.append((news, topics_to_counts, lock))
    with multiprocessing.Pool() as p:
        p.starmap(get_topic_count_worker, args)
    for key, value in topics_to_counts.items():
        print(f"{key}: {value}")

Any guidance here would be appreciated!

Update: There are about 1.6 million records that it needs to go through. How would I chunk this properly?

Update 2: Here's some sample data:

enter image description here

Update 3:

Here is the relation in the RawNews table:

topics = models.ManyToManyField('Topic', blank=True)
bballboy8
  • 400
  • 6
  • 25
  • Can you please share some dummy data (4 - 5 records) in order to simulate in my machine ? – HarshIT May 31 '22 at 08:55
  • It looks like you made queries in loop, can you try reducing queries by making a join query ? – HarshIT May 31 '22 at 09:08
  • Could you explain? – bballboy8 May 31 '22 at 09:12
  • I simulated similar situation in my machine, by creating 1.6 million dummy records in json variables and performed similar operations that you made. It worked fine for me and finished in couple of minutes. When I observe above code, it looks like ORM (sqlalchemy) code. You are querying your tables in loops (in "get_topic_count_worker" function) and that is the root cause. – HarshIT May 31 '22 at 09:14
  • Topic table is only about 100 records. RawNews table has 1.6 million records. Regardless, what would be the better way to approach it? – bballboy8 May 31 '22 at 09:16
  • Instead of querying in loops, you should make JOIN query (https://www.tutorialspoint.com/sqlalchemy/sqlalchemy_orm_working_with_joins.htm) or call SP (https://discuss.dizzycoding.com/stored-procedures-with-sqlalchemy/) – HarshIT May 31 '22 at 09:16
  • Can you share me Those two tables' schema and relationship to help you with query optimisation ? – HarshIT May 31 '22 at 09:33
  • Can you please share few records (4 - 5 records) of Topics table ? I just wanted to see column names. – HarshIT May 31 '22 at 11:06
  • I simulated your scenario: Here is sample Db: https://tmpfiles.org/dl/303864/news_db.db Here is query I used: https://pastebin.com/aaj8t7qu Does this work for you? – HarshIT May 31 '22 at 13:44
  • The issue happens during the loop to count the number of each topic since we want to count here topic name in each raw news content. This needs to be put into a multiprocessing block. – bballboy8 May 31 '22 at 14:16
  • I recommend to optimise your query in such a way that you won't need to process data after query (like looping as you did above). Still if you want to use multiprocessing, please note that python is poor with multi threading, multip-processing etc. Consider using Producer-Consumer (like RabbitMQ) architecture. – HarshIT May 31 '22 at 16:16

1 Answers1

0

The problem was related to a restriction on the database. It speeds up the process to multithread, but the database has a restriction of 100 pings at a single time. Either increase this connection or max out the number of threads at any given time to a number less than 100.

bballboy8
  • 400
  • 6
  • 25