1

I wrote a python script for querying this endpoint using SPARQL in order to get some information about genes. This is how the script works:

Get genes
Foreach gene:
    Get proteins
        Foreach proteins
            Get the protein function
            .....
    Get Taxons
    ....

but the script takes too much time to execute. I did the profiling using pyinstrument and I got the following results:

  39.481 <module>  extracting_genes.py:10
  `- 39.282 _main  extracting_genes.py:750
     |- 21.629 create_prot_func_info_dico  extracting_genes.py:613
     |  `- 21.609 get_prot_func_info  extracting_genes.py:216
     |     `- 21.596 query  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:780
     |        `- 21.596 _query  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:750
     |           `- 21.588 urlopen  urllib2.py:131
     |              `- 21.588 open  urllib2.py:411
     |                 `- 21.588 _open  urllib2.py:439
     |                    `- 21.588 _call_chain  urllib2.py:399
     |                       `- 21.588 http_open  urllib2.py:1229
     |                          `- 21.588 do_open  urllib2.py:1154
     |                             |- 11.207 request  httplib.py:1040
     |                             |  `- 11.207 _send_request  httplib.py:1067
     |                             |     `- 11.205 endheaders  httplib.py:1025
     |                             |        `- 11.205 _send_output  httplib.py:867
     |                             |           `- 11.205 send  httplib.py:840
     |                             |              `- 11.205 connect  httplib.py:818
     |                             |                 `- 11.205 create_connection  socket.py:541
     |                             |                    `- 9.552 meth  socket.py:227
     |                             `- 10.379 getresponse  httplib.py:1084
     |                                `- 10.379 begin  httplib.py:431
     |                                   `- 10.379 _read_status  httplib.py:392
     |                                      `- 10.379 readline  socket.py:410
     |- 6.045 create_gene_info_dico  extracting_genes.py:323
     |  `- 6.040 ...
     |- 3.957 create_prots_info_dico  extracting_genes.py:381
     |  `- 3.928 ...
     |- 3.414 create_taxons_info_dico  extracting_genes.py:668
     |  `- 3.414 ...
     |- 3.005 create_prot_parti_info_dico  extracting_genes.py:558
     |  `- 2.999 ...
     `- 0.894 create_prot_loc_info_dico  extracting_genes.py:504
        `- 0.893 ...

Basically I'm executing multiple queries multiple times (+60000) so what I've understood is that opening the connection and getting response are done mutiple times which slows the execution.

Does anyone have an idea how to tackle this issue ?

Bilal
  • 2,883
  • 5
  • 37
  • 60
  • 1
    Please show your queries, perhaps it is possible to reduce the number of them. It seems that urllib2 diesn't suport persistent connections. – Stanislav Kralin Jul 30 '18 at 10:51
  • why should this be a problem with connection pooling? it's just an http request send to a virtuoso triple store. the computation of the queries itself takes some time as well as sending the resultset. – UninformedUser Jul 30 '18 at 12:07
  • @StanislavKralin I have 8 queries which are executed multiples times, I've tried to combine them but when I did, I got a big query which too complicated to handle. – Bilal Jul 30 '18 at 12:24
  • @AKSW because I run this script on a database that has almost 600.000 gene, each gene has Proteins and Taxons... and it takes a lot of time to get the desired results so I'm trying to optimize the script to get to output as quick as possible – Bilal Jul 30 '18 at 12:27

1 Answers1

1

As @Stanislav montioned, the urllib2 which's used by SPARQLWrapper Doesn't support persistent connections but I found a way to keep the connection alive, using setUseKeepAlive() function defined in SPARQLWrapper/Wrapper.py.

I had to install the keepalive package first:

pip install keepalive

It reduced the excution time by almost 40%.

def get_all_genes_uri(endpoint, the_offset):
    sparql = SPARQLWrapper(endpoint)
    sparql.setUseKeepAlive() # <--- Added this line
    sparql.setQuery("""
        #My_query
    """)
    ....

And got the following results:

  24.673 <module>  extracting_genes.py:10
  `- 24.473 _main  extracting_genes.py:750
     |- 12.314 create_prot_func_info_dico  extracting_genes.py:613
     |  `- 12.068 get_prot_func_info  extracting_genes.py:216
     |     |- 11.428 query  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:780
     |     |  `- 11.426 _query  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:750
     |     |     `- 11.353 urlopen  urllib2.py:131
     |     |        `- 11.353 open  urllib2.py:411
     |     |           `- 11.339 _open  urllib2.py:439
     |     |              `- 11.338 _call_chain  urllib2.py:399
     |     |                 `- 11.338 http_open  keepalive/keepalive.py:343
     |     |                    `- 11.338 do_open  keepalive/keepalive.py:213
     |     |                       `- 11.329 _reuse_connection  keepalive/keepalive.py:264
     |     |                          `- 11.280 getresponse  httplib.py:1084
     |     |                             `- 11.262 begin  httplib.py:431
     |     |                                `- 11.207 _read_status  httplib.py:392
     |     |                                   `- 11.204 readline  socket.py:410
     |     `- 0.304 __init__  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:261
     |        `- 0.292 resetQuery  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:301
     |           `- 0.288 setQuery  build/bdist.linux-x86_64/egg/SPARQLWrapper/Wrapper.py:516
     |- 4.894 create_gene_info_dico  extracting_genes.py:323
     |  `- 4.880 ...
     |- 2.631 create_prots_info_dico  extracting_genes.py:381
     |  `- 2.595 ...
     |- 1.933 create_taxons_info_dico  extracting_genes.py:668
     |  `- 1.923 ...
     |- 1.804 create_prot_parti_info_dico  extracting_genes.py:558
     |  `- 1.780 ...
     `- 0.514 create_prot_loc_info_dico  extracting_genes.py:504
        `- 0.510 ...

Honestly, the execution time is still not as quick as I want, I'll see if there is somethings else that I can do.

Bilal
  • 2,883
  • 5
  • 37
  • 60
  • 2
    The only thing else you could do is to show the queries and hope that there is some potential for optimization. For example, SPARQL 1.1 provides VALUES clause, maybe something you could use when you're running the same query multiple times with e.g. just a different entity in a specific place of a triple pattern. – UninformedUser Jul 30 '18 at 15:21
  • You might encourage the endpoint owners/operators to upgrade from their existing Virtuoso (Open Source Edition `7.10.3211`, built Feb 23 2015), to a current VOS 7.2 (the `develop/7` branch is recommended; the `stable/7` branch is next-best), Enterprise Edition 7.2, or even Enterprise Edition 8.1, any of which will substantially improve performance, among other benefits. Increasing Virtuoso's available RAM from the current 7 GB would also be very beneficial! – TallTed Jul 30 '18 at 21:51