3

I am using Python 2.7 (Anaconda distribution) on Windows 8.1 Pro. I have a database of articles with their respective topics.

I am building an application which queries textual phrases in my database and associates article topics to each queried phrase. The topics are assigned based on the relevance of the phrase for the article.

The bottleneck seems to be Python socket communication with the localhost.

Here are my cProfile outputs:

    topics_fit (PhraseVectorizer_1_1.py:668)
    function called 1 times

         1930698 function calls (1929630 primitive calls) in 148.209 seconds

    Ordered by: cumulative time, internal time, call count
    List reduced from 286 to 40 due to restriction <40>

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.224    1.224  148.209  148.209 PhraseVectorizer_1_1.py:668(topics_fit)
    206272    0.193    0.000  146.780    0.001 cursor.py:1041(next)
      601    0.189    0.000  146.455    0.244 cursor.py:944(_refresh)
      534    0.030    0.000  146.263    0.274 cursor.py:796(__send_message)
      534    0.009    0.000  141.532    0.265 mongo_client.py:725(_send_message_with_response)
      534    0.002    0.000  141.484    0.265 mongo_client.py:768(_reset_on_error)
      534    0.019    0.000  141.482    0.265 server.py:69(send_message_with_response)
      534    0.002    0.000  141.364    0.265 pool.py:225(receive_message)
      535    0.083    0.000  141.362    0.264 network.py:106(receive_message)
     1070    1.202    0.001  141.278    0.132 network.py:127(_receive_data_on_socket)
     3340  140.074    0.042  140.074    0.042 {method 'recv' of '_socket.socket' objects}
      535    0.778    0.001    4.700    0.009 helpers.py:88(_unpack_response)
      535    3.828    0.007    3.920    0.007 {bson._cbson.decode_all}
       67    0.099    0.001    0.196    0.003 {method 'sort' of 'list' objects}
    206187    0.096    0.000    0.096    0.000 PhraseVectorizer_1_1.py:705(<lambda>)
    206187    0.096    0.000    0.096    0.000 database.py:339(_fix_outgoing)
    206187    0.074    0.000    0.092    0.000 objectid.py:68(__init__)
     1068    0.005    0.000    0.054    0.000 server.py:135(get_socket)
  1068/534    0.010    0.000    0.041    0.000 contextlib.py:21(__exit__)
      1068    0.004    0.000    0.041    0.000 pool.py:501(get_socket)
       534    0.003    0.000    0.028    0.000 pool.py:208(send_message)
       534    0.009    0.000    0.026    0.000 pool.py:573(return_socket)
       567    0.001    0.000    0.026    0.000 socket.py:227(meth)
      535    0.024    0.000    0.024    0.000 {method 'sendall' of '_socket.socket' objects}
      534    0.003    0.000    0.023    0.000 topology.py:134(select_server)
   206806    0.020    0.000    0.020    0.000 collection.py:249(database)
   418997    0.019    0.000    0.019    0.000 {len}
      449    0.001    0.000    0.018    0.000 topology.py:143(select_server_by_address)
      534    0.005    0.000    0.018    0.000 topology.py:82(select_servers)
     1068/534    0.001    0.000    0.018    0.000 contextlib.py:15(__enter__)
      534    0.002    0.000    0.013    0.000 thread_util.py:83(release)
   207307    0.010    0.000    0.011    0.000 {isinstance}
      534    0.005    0.000    0.011    0.000 pool.py:538(_get_socket_no_auth)
      534    0.004    0.000    0.011    0.000 thread_util.py:63(release)
      534    0.001    0.000    0.011    0.000 mongo_client.py:673(_get_topology)
      535    0.003    0.000    0.010    0.000 topology.py:57(open)
   206187    0.008    0.000    0.008    0.000 {method 'popleft' of 'collections.deque' objects}
      535    0.002    0.000    0.007    0.000 topology.py:327(_apply_selector)
      536    0.003    0.000    0.007    0.000 topology.py:286(_ensure_opened)
     1071    0.004    0.000    0.007    0.000 periodic_executor.py:50(open)

In particular: {method 'recv' of '_socket.socket' objects} seems to cause trouble.

According to suggestions found in What can I do to improve socket performance in Python 3?, I tried gevent.

I added this snippet at the beginning of my script (before importing anything):

from gevent import monkey
monkey.patch_all()

This resulted in even slower performance...

*** PROFILER RESULTS ***
topics_fit (PhraseVectorizer_1_1.py:671)
function called 1 times

         1956879 function calls (1951292 primitive calls) in 158.260 seconds

   Ordered by: cumulative time, internal time, call count
   List reduced from 427 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  158.170  158.170 hub.py:358(run)
        1    0.000    0.000  158.170  158.170 {method 'run' of 'gevent.core.loop' objects}
      2/1    1.286    0.643  158.166  158.166 PhraseVectorizer_1_1.py:671(topics_fit)
   206272    0.198    0.000  156.670    0.001 cursor.py:1041(next)
      601    0.192    0.000  156.203    0.260 cursor.py:944(_refresh)
      534    0.029    0.000  156.008    0.292 cursor.py:796(__send_message)
      534    0.012    0.000  150.514    0.282 mongo_client.py:725(_send_message_with_response)
      534    0.002    0.000  150.439    0.282 mongo_client.py:768(_reset_on_error)
      534    0.017    0.000  150.437    0.282 server.py:69(send_message_with_response)
  551/535    0.002    0.000  150.316    0.281 pool.py:225(receive_message)
  552/536    0.079    0.000  150.314    0.280 network.py:106(receive_message)
1104/1072    0.815    0.001  150.234    0.140 network.py:127(_receive_data_on_socket)
2427/2395    0.019    0.000  149.418    0.062 socket.py:381(recv)
  608/592    0.003    0.000   48.541    0.082 socket.py:284(_wait)
      552    0.885    0.002    5.464    0.010 helpers.py:88(_unpack_response)
      552    4.475    0.008    4.577    0.008 {bson._cbson.decode_all}
     3033    2.021    0.001    2.021    0.001 {method 'recv' of '_socket.socket' objects}
      7/4    0.000    0.000    0.221    0.055 hub.py:189(_import)
        4    0.127    0.032    0.221    0.055 {__import__}
       67    0.104    0.002    0.202    0.003 {method 'sort' of 'list' objects}
  536/535    0.003    0.000    0.142    0.000 topology.py:57(open)
  537/536    0.002    0.000    0.139    0.000 topology.py:286(_ensure_opened)
1072/1071    0.003    0.000    0.138    0.000 periodic_executor.py:50(open)
  537/536    0.001    0.000    0.136    0.000 server.py:33(open)
  537/536    0.001    0.000    0.135    0.000 monitor.py:69(open)
    20/19    0.000    0.000    0.132    0.007 topology.py:342(_update_servers)
        4    0.000    0.000    0.131    0.033 hub.py:418(_get_resolver)
        1    0.000    0.000    0.122    0.122 resolver_thread.py:13(__init__)
        1    0.000    0.000    0.122    0.122 hub.py:433(_get_threadpool)
   206187    0.081    0.000    0.101    0.000 objectid.py:68(__init__)
   206187    0.100    0.000    0.100    0.000 database.py:339(_fix_outgoing)
   206187    0.098    0.000    0.098    0.000 PhraseVectorizer_1_1.py:708(<lambda>)
        1    0.073    0.073    0.093    0.093 threadpool.py:2(<module>)
     2037    0.003    0.000    0.092    0.000 hub.py:159(get_hub)
        2    0.000    0.000    0.090    0.045 thread.py:39(start_new_thread)
        2    0.000    0.000    0.090    0.045 greenlet.py:195(spawn)
        2    0.000    0.000    0.090    0.045 greenlet.py:74(__init__)
        1    0.000    0.000    0.090    0.090 hub.py:259(__init__)
     1102    0.004    0.000    0.078    0.000 pool.py:501(get_socket)
     1068    0.005    0.000    0.074    0.000 server.py:135(get_socket)

This performance is somewhat unacceptable for my application - I would like it to be much faster (this is timed and profiled for a subset of ~20 documents, and I need to process few tens of thousands).

Any ideas on how to speed it up?

Much appreciated.

Edit: Code snippet that I profiled:

# also tried monkey patching all here, see profiler

from pymongo import MongoClient

def topics_fit(self):

    client = MongoClient()
    # tried motor for multithreading - also slow
    #client = motor.motor_tornado.MotorClient()

    # initialize DB cursors
    db_wiki = client.wiki

    # initialize topic feature dictionary
    self.topics = OrderedDict()
    self.topic_mapping = OrderedDict()

    vocabulary_keys = self.vocabulary.keys()

    num_categories = 0

    for phrase in vocabulary_keys:

        phrase_tokens = phrase.split()

        if len(phrase_tokens) > 1:

            # query for current phrase
            AND_phrase = "\"" + phrase + "\""

            cursor = db_wiki.categories.find({ "$text" : { "$search": AND_phrase } },{ "score": { "$meta": "textScore" } })
            cursor = list(cursor)

            if cursor:
                cursor.sort(key=lambda k: k["score"], reverse = True)
                added_categories = cursor[0]["category_ids"]
                for added_category in added_categories:
                    if not (added_category in self.topics):
                        self.topics[added_category] = num_categories
                        if not (self.vocabulary[phrase] in self.topic_mapping):
                            self.topic_mapping[self.vocabulary[phrase]] = [num_categories, ]
                        else:
                            self.topic_mapping[self.vocabulary[phrase]].append(num_categories)
                        num_categories+=1
                    else:
                        if not (self.vocabulary[phrase] in self.topic_mapping):
                            self.topic_mapping[self.vocabulary[phrase]] = [self.topics[added_category], ]
                        else:
                            self.topic_mapping[self.vocabulary[phrase]].append(self.topics[added_category])

Edit 2: output of index_information():

{u'_id_': 
    {u'ns': u'wiki.categories', u'key': [(u'_id', 1)], u'v': 1},  
    u'article_title_text_article_body_text_category_names_text': {u'default_language': u'english', u'weights': SON([(u'article_body', 1), (u'article_title', 1), (u'category_names', 1)]), u'key': [(u'_fts', u'text'), (u'_ftsx', 1)], u'v': 1, u'language_override': u'language', u'ns': u'wiki.categories', u'textIndexVersion': 2}}
Community
  • 1
  • 1
El Brutale
  • 93
  • 8
  • Have you added the relevant indices in MongoDB ? Can you reduce the number of calls made to the database ? – Alex Dec 01 '15 at 15:18
  • I have created indices for all my text fields. (followed this: https://docs.mongodb.org/manual/tutorial/create-text-index-on-multiple-fields/) Do I need to re-create text indices every time I open the client? I did it once, after I created the database. I have to associate a topic to each unique phrase I generate from my textual documents. Unless there is a method to batch query multiple phrases, I don't see how can I reduce the number of calls. – El Brutale Dec 01 '15 at 15:50
  • I have added my code, hope it helps. I have tried the motor library for python, which was the same. Spent 145.634 seconds on hub.py:336(wait)... – El Brutale Dec 01 '15 at 16:02

0 Answers0