12

While reading large relations from a SQL database to a pandas dataframe, it would be nice to have a progress bar, because the number of tuples is known statically and the I/O rate could be estimated. It looks like the tqdm module has a function tqdm_pandas which will report progress on mapping functions over columns, but by default calling it does not have the effect of reporting progress on I/O like this. Is it possible to use tqdm to make a progress bar on a call to pd.read_sql?

Tom Burrows
  • 2,225
  • 2
  • 29
  • 46
seewalker
  • 1,123
  • 10
  • 18
  • 3
    I'm afraid not, because `pandas` just dispatches the query to the database and waits for a response. There's no intermediate feedback until the entire result set arrives. – Michael Griffiths Oct 27 '16 at 12:06

3 Answers3

9

Edit: Answer is misleading - chunksize has no effect on database side of the operation. See comments below.

You could use the chunksize parameter to do something like this:

chunks = pd.read_sql('SELECT * FROM table', con=conn, chunksize=100)

df = pd.DataFrame()
for chunk in tqdm(chunks):
    df = pd.concat([df, chunk])

I think this would use less memory as well.

Alex
  • 12,078
  • 6
  • 64
  • 74
  • 12
    This doesn't work because the chunking happens *after* `read_sql` has finished loading the entire set. All `tqdm` would do here is measure the progress of the `pd.concat` operation. – Steven Jul 15 '20 at 01:17
  • 3
    Concating the chunl one by one is very inefficient and should be generally avoid. – Wei Qiu Nov 13 '20 at 13:37
  • How, then, is this marked as the accepted answer if it doesn't work..? – jtlz2 Feb 10 '22 at 08:18
  • 1
    I just figured that if we pass `execution_options = {'stream_results':True}` to create_engine, then it doesn't load the whole thing at once. instead the db will stream one chunk at a time and the physical memory limitation wouldn't bother ya. – Oshan Mar 30 '23 at 13:51
0

I wanted to provide an alternative answer, given the previous solutions really only provide status to the concat operations. There is still a bit of "best guess" work, but think it's a bit closer to a viable solution.

Summary

The general concept is to check the status of a function execution every second and populate a status bar up to an arbitrarily defined max number of seconds. YMMV on these values. (If the execution takes longer than the total defined, it simply removes the status bar)

  • Create a manual tqdm progress bar and adjust the settings as you see fit.
  • Use asyncio.create_task to spawn a task. Wrap the pandas read method in asyncio.to_thread to create as a Coroutine.
  • Check the task status in a while loop, updating the progress bar if the the execution is not done - or returning the results if complete.
async def sql_handler(query, conn):
    progress = tqdm(desc="Executing SQL Query",total=8)
    task = asyncio.create_task(
        asyncio.to_thread(pd.read_sql, sql=query, con=connection)
    )

    while True:
        status = task.done()
        progress.update()
        if status:
            return task.result()
        await asyncio.sleep(1)

async def main():
    await self.sql_handler(query=..., conn=...)

if __name__ == "__main__":
    asyncio.run(main())

Full Example

Here's a more complete example using sqlalchemy to read from a snowflake database and including some settings to help get the desired output

import asyncio
import pandas as pd
from pandas import DataFrame
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from sqlalchemy.engine import Engine
from tqdm import tqdm

async def sql_handler(query: str, connection: Engine) -> DataFrame:
    progress = tqdm(
        desc="Executing SQL Query",
        ncols=100,
        bar_format="{desc}: {bar}[{elapsed}]",
        total=8,
        )
    task = asyncio.create_task(
        asyncio.to_thread(pd.read_sql, sql=query, con=connection)
    )
    while True:
        status = task.done()
        progress.update()
        if status:
            return task.result()
        # With progress.total = 8 -> Check status every 1 second for 8 seconds
        await asyncio.sleep(1)

async def main() -> None:
    conn = create_engine(
        URL(account=...,
            user=...,
            password=...,
            warehouse=...,
            database=...,
            schema=...,
        )
    )
    query = "SELECT * FROM table"
    results_data_frame = await self.sql_handler(query, conn)
    print(results_data_frame.to_dict(orient="records")

if __name__ == "__main__":
    asyncio.run(main())

Output

Executing SQL Query: █████████████████████████████████████████      [00:03]

asyncio_tqdm

jmgreg31
  • 11
  • 2
-1

yes! you can!

expanding the answer here, and Alex answer, to include tqdm, we get:

# get total number or rows
q = f"SELECT COUNT(*) FROM table"
total_rows = pd.read_sql_query(q, conn).values[0, 0]
# note that COUNT implementation should not download the whole table. 
# some engine will prefer you to use SELECT MAX(ROWID) or whatever...

# read table with tqdm status bar
q = f"SELECT * FROM table"
rows_in_chunk = 1_000
chunks = pd.read_sql_query(q, conn, chunksize=rows_in_chunk)
df = tqdm(chunks, total=total_rows/rows_in_chunk)
df = pd.concat(df)

output example:

39%|███▉      | 99/254.787 [01:40<02:09,  1.20it/s]
lisrael1
  • 348
  • 2
  • 7
  • This doesn't work. `pd.read_sql_query` loads all the chunks in memory. `tqdm` is only displaying progress of concatenation process. – Oshan Mar 29 '23 at 09:33
  • I think it do works, otherwise what chunksize at pd.read_sql_query is for... from the manuals on chunsize: If specified, return an iterator where chunksize is the number of rows to include in each chunk. at this code, when pd.concat uses df, python first extract the tqdm with the fetching inside it, and after finishing extracting, it applies the concatenation. – lisrael1 Aug 24 '23 at 21:14