Performance: look in list or sql query

Question

I developed a software with PyQt and sqlite to manage scientific articles. Each article is stored in the sqlite database, and comes from a particular journal.

Sometimes, I need to perform some verifications on the articles of a journal. So I build two lists, one containing the DOI of the articles (a DOI is just a unique id for an article), and one containing booleans, True if the articles are ok, False if the articles are not:

def listDoi(self, journal_abb):

    """Function to get the doi from the database.
    Also returns a list of booleans to check if the data are complete"""

    list_doi = []
    list_ok = []

    query = QtSql.QSqlQuery(self.bdd)
    query.prepare("SELECT * FROM papers WHERE journal=?")
    query.addBindValue(journal_abb)
    query.exec_()

    while query.next():
        record = query.record()
        list_doi.append(record.value('doi'))

        if record.value('graphical_abstract') != "Empty":
            list_ok.append(True)
        else:
            list_ok.append(False)

    return list_doi, list_ok

This function returns the two lists. The lists can contain ~2000 items each. After that, to check if an article is ok, I just check if it is in the two lists.

EDIT: I also need to check if an article is only in list_doi.

So I wonder, because performance matters here: what is faster/better/more economic:

build the two lists, and check if the article is present in the two lists
write the function in another way: checkArticle(doi_article), and the function would perform a SQL query for each article

What about the speed and the space in RAM ? Will the results be different if there are few items, or a lot of them ?

score 0 · Answer 1 · edited May 23 '17 at 12:22

Use time.perf_counter() to determine how long this process takes currently.

time_start = time.perf_counter()
# your code here
print(time.perf_counter() - time_start)

Based on that, if it is going too slow(), you can try each of your options, and time them as well to look for an improvement in performance. As for checking the RAM usage, a simple way is this:

import os
import psutil
process = psutil.Process(os.getpid())
print process.get_memory_info()[0] / float(2 ** 20)    # return the memory usage in MB

For a more in-depth memory usage check, look here: https://stackoverflow.com/a/110826/3841261 Always have a way to objectively measure when looking to improve speed/RAM usage/etc.

score 0 · Answer 2 · answered Sep 21 '15 at 14:00

0

I would execute one sql query that finds the articles that are OK at once (perhaps in a function called find_articles() or something)

Think of it this way, why do something twice (copy all those rows and work with them) when you could do it once?

You want to basically execute this:

SELECT * from papers where (PAPERID in OTHERTABLE and OTHER RESTRAINT = "WHATEVER")

That's obviously just Pseudocode but I think you can figure it out.

answered Sep 21 '15 at 14:00

NineToeNerd

307
5
17

Sorry, I forgot to mention I also need to check if the article is in list_doi alone. See it that way: article not in list_doi and not in list_ok -> action a; article in list_doi but not in list_ok -> action b; article in list_doi and in list_ok -> action c – JPFrancoia Sep 21 '15 at 14:11
And list_doi is just a list of article ids right? So you can do the same thing for a function called has_id or in _table by executing the query ("select count * from tablename where id=" + ID ) >0 – NineToeNerd Sep 21 '15 at 14:13
Yes, I could do it. The question is not there: what would more efficient ? Perform a SQL query each time I need to check an article, or build the two lists once and search them each time I need to check an article ? – JPFrancoia Sep 21 '15 at 14:16
I see what you are asking. One sec I just found an article that should help you. – NineToeNerd Sep 21 '15 at 14:18

Performance: look in list or sql query

2 Answers2