Situation
I am using Python 3.7.2 with its built-in sqlite3 module. (sqlite3.version == 2.6.0)
I have a sqlite database that looks like:
| user_id | action | timestamp |
| ------- | ------ | ---------- |
| Alice | 0 | 1551683796 |
| Alice | 23 | 1551683797 |
| James | 1 | 1551683798 |
| ....... | ...... | .......... |
where user_id
is TEXT
, action
is an arbitary INTEGER
, and timestamp
is an INTEGER
representing UNIX time.
The database has 200M rows, and there are 70K distinct user_id
s.
Goal
I need to make a Python dictionary that looks like:
{
"Alice":[(0, 1551683796), (23, 1551683797)],
"James":[(1, 1551683798)],
...
}
that has user_id
s as keys and respective event logs as values, which are lists of tuples (action, timestamp)
. Hopefully each list will be sorted by timestamp
in increasing order, but even if it isn't, I think it can be easily achieved by sorting each list after a dictionary is made.
Effort
I have the following code to query the database. It first queries for the list of users (with user_list_cursor
), and then query for all rows belonging to the user.
import sqlite3
connection = sqlite3.connect("database.db")
user_list_cursor = connection.cursor()
user_list_cursor.execute("SELECT DISTINCT user_id FROM EVENT_LOG")
user_id = user_list_cursor.fetchone()
classified_log = {}
log_cursor = connection.cursor()
while user_id:
user_id = user_id[0] # cursor.fetchone() returns a tuple
query = (
"SELECT action, timestamp"
" FROM TABLE"
" WHERE user_id = ?"
" ORDER BY timestamp ASC"
)
parameters = (user_id,)
local_cursor.execute(query, parameters) # Here is the bottleneck
classified_log[user_id] = list()
for row in local_cursor.fetchall():
classified_log[user_id].append(row)
user_id = user_list_cursor.fetchone()
Problem
The query execution for each user is too slow. That single line of code (which is commented as bottleneck) takes around 10 seconds for each user_id
. I think I am making a wrong approach with the queries. What is the right way to achieve the goal?
I tried searching with keywords "classify db by a column", "classify sql by a column", "sql log to dictionary python", but nothing seems to match my situation. I think this wouldn't be a rare need, so maybe I'm missing the right keyword to search with.
Reproducibility
If anyone is willing to reproduce the situation with a 200M row sqlite database, the following code will create a 5GB database file.
But I hope there is somebody who is familiar with such a situation and knows how to write the right query.
import sqlite3
import random
connection = sqlite3.connect("tmp.db")
cursor = connection.cursor()
cursor.execute(
"CREATE TABLE IF NOT EXISTS EVENT_LOG (user_id TEXT, action INTEGER, timestamp INTEGER)"
)
query = "INSERT INTO EVENT_LOG VALUES (?, ?, ?)"
parameters = []
for timestamp in range(200_000_000):
user_id = f"user{random.randint(0, 70000)}"
action = random.randint(0, 1_000_000)
parameters.append((user_id, action, timestamp))
cursor.executemany(query, parameters)
connection.commit()
cursor.close()
connection.close()