Why dask's read_sql_table requires a index_col parameter?

Question

I'm trying to use the read_sql_table from dask but I'm facing some issues related to the index_col parameter. My sql table doesn't have any numeric value and I don't know what to give to the index_col parameter.

I read at the documentation that if the index_col is of type "object" I have to provide the "divisions" parameter, but I don't know what are the values in my index_col before reading the table.

I'm really confused. Don't know why I have to give an index_col when using read_sql_table but don't have to when using read_csv.

score 1 · Answer 1 · answered Jul 09 '20 at 18:58

1

I've found in certain situations it's easiest to handle this by scattering DataFrame objects out to the cluster by way of pd.read_sql and its chunksize argument:

from dask import bag as db

sql_text = "SELECT ..."
sql_meta = {"column0": "object", "column1": "uint8"}
sql_conn = connect(...)
dfs_futs = map(client.scatter,               # Scatter each object to the cluster
               pd.read_sql(sql_text, 
                           sql_conn, 
                           chunksize=10_000, # Iterate in chunks of 10,000
                           columns=list(sql_meta.keys())))

# Now join our chunks (remotely) into a single frame.
df = db.from_sequence(list(dfs_futs)).to_dataframe(meta=sql_meta)

This is nice since you don't need to handle any potential drivers/packages that would be cumbersome to manage on distributed nodes and/or situations where it's difficult to easily partition your data.

Just a note on performance, for my use case we leverage our database's external table operations to spool data out to a CSV and then read that with pd.read_csv (it's pretty much the same deal as above) while a SELECT ... FROM ... WHERE compared to the way Dask parallelizes and chunks up queries, can be acceptable performance-wise since there is a cost to performing the chunking inside the database.

answered Jul 09 '20 at 18:58

joebeeson

4,159
1
22
29

1

OK, but note that all data is streaming through your client. It should be OK on memory use, at least. – mdurant Jul 09 '20 at 19:33
Correct, which is fine with me since my client is co-located inside the same data center as my database appliance. – joebeeson Jul 09 '20 at 22:23
Hi @joebeeson, thank you for you answer. I didn't quite understand what are the cons of you approach. Could you try to clarify? – Thiago Dantas Jul 10 '20 at 17:27
The downside is that you have to shuttle all the data from the database, to the executing process, then to the cluster. This is manageable memory-wise for any size of input since it's done in chunks (_so long your chunk size < total memory_) – joebeeson Jul 10 '20 at 19:32
what value we need to provide for `client`? – Rajnish kumar Nov 25 '20 at 13:27
1

@Rajnishkumar the `client` should be the `distributed.Client` object for your Dask scheduler. – joebeeson Nov 25 '20 at 16:35

score 0 · Answer 2 · answered Jul 09 '20 at 17:56

0

Dask needs a way to be able to read the partitions of your data independently of one-another. This means being able to phrase the queries of each part with a clause like "WHERE index_col >= val0 AND index_col < val1". If you have nothing numerical, dask cab't guess reasonable values for you, you can still do this if you can determine a way to provide reasonable delimiters, like list(string.ascii_letters). You can also provide your own complete WHERE clauses if you must.

Note that OFFSET/LIMIT does not work for this task, because

the result is not in general guaranteed for any given inputs (this behaviour is database implementation specific)
getting to some particular offset is done by paging through the results of a while query, so the server has to do many time the amount of work necessary

answered Jul 09 '20 at 17:56

mdurant

27,272
5
45
74

Thank you for your answer! Still I don't understand why the read_csv method doesn't require the index_col... Why doesn't dask just create a default index table (like pandas)? Also, I'd like to know what you suggest as a solution for me: I wan't to read a SQL table that doesn't have an obvious index_col. All the columns are strings and I don't know previously what are the values in the columns, so I can't define the "divisions" parameter. – Thiago Dantas Jul 10 '20 at 17:21
1

Then you are out of luck - dask needs a way to partition your dataframe, so that each worker task can act independently. The DB implementations do not, in general, provide a default numerical index, which is easy to make only once you have all of the data. – mdurant Jul 10 '20 at 21:22

Why dask's read_sql_table requires a index_col parameter?

2 Answers2