3

I'm trying to use the read_sql_table from dask but I'm facing some issues related to the index_col parameter. My sql table doesn't have any numeric value and I don't know what to give to the index_col parameter.

I read at the documentation that if the index_col is of type "object" I have to provide the "divisions" parameter, but I don't know what are the values in my index_col before reading the table.

I'm really confused. Don't know why I have to give an index_col when using read_sql_table but don't have to when using read_csv.

2 Answers2

1

I've found in certain situations it's easiest to handle this by scattering DataFrame objects out to the cluster by way of pd.read_sql and its chunksize argument:

from dask import bag as db

sql_text = "SELECT ..."
sql_meta = {"column0": "object", "column1": "uint8"}
sql_conn = connect(...)
dfs_futs = map(client.scatter,               # Scatter each object to the cluster
               pd.read_sql(sql_text, 
                           sql_conn, 
                           chunksize=10_000, # Iterate in chunks of 10,000
                           columns=list(sql_meta.keys())))

# Now join our chunks (remotely) into a single frame.
df = db.from_sequence(list(dfs_futs)).to_dataframe(meta=sql_meta)              

This is nice since you don't need to handle any potential drivers/packages that would be cumbersome to manage on distributed nodes and/or situations where it's difficult to easily partition your data.

Just a note on performance, for my use case we leverage our database's external table operations to spool data out to a CSV and then read that with pd.read_csv (it's pretty much the same deal as above) while a SELECT ... FROM ... WHERE compared to the way Dask parallelizes and chunks up queries, can be acceptable performance-wise since there is a cost to performing the chunking inside the database.

joebeeson
  • 4,159
  • 1
  • 22
  • 29
  • 1
    OK, but note that all data is streaming through your client. It should be OK on memory use, at least. – mdurant Jul 09 '20 at 19:33
  • Correct, which is fine with me since my client is co-located inside the same data center as my database appliance. – joebeeson Jul 09 '20 at 22:23
  • Hi @joebeeson, thank you for you answer. I didn't quite understand what are the cons of you approach. Could you try to clarify? – Thiago Dantas Jul 10 '20 at 17:27
  • The downside is that you have to shuttle all the data from the database, to the executing process, then to the cluster. This is manageable memory-wise for any size of input since it's done in chunks (_so long your chunk size < total memory_) – joebeeson Jul 10 '20 at 19:32
  • what value we need to provide for `client`? – Rajnish kumar Nov 25 '20 at 13:27
  • 1
    @Rajnishkumar the `client` should be the `distributed.Client` object for your Dask scheduler. – joebeeson Nov 25 '20 at 16:35
0

Dask needs a way to be able to read the partitions of your data independently of one-another. This means being able to phrase the queries of each part with a clause like "WHERE index_col >= val0 AND index_col < val1". If you have nothing numerical, dask cab't guess reasonable values for you, you can still do this if you can determine a way to provide reasonable delimiters, like list(string.ascii_letters). You can also provide your own complete WHERE clauses if you must.

Note that OFFSET/LIMIT does not work for this task, because

  • the result is not in general guaranteed for any given inputs (this behaviour is database implementation specific)
  • getting to some particular offset is done by paging through the results of a while query, so the server has to do many time the amount of work necessary
mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Thank you for your answer! Still I don't understand why the read_csv method doesn't require the index_col... Why doesn't dask just create a default index table (like pandas)? Also, I'd like to know what you suggest as a solution for me: I wan't to read a SQL table that doesn't have an obvious index_col. All the columns are strings and I don't know previously what are the values in the columns, so I can't define the "divisions" parameter. – Thiago Dantas Jul 10 '20 at 17:21
  • 1
    Then you are out of luck - dask needs a way to partition your dataframe, so that each worker task can act independently. The DB implementations do not, in general, provide a default numerical index, which is easy to make only once you have all of the data. – mdurant Jul 10 '20 at 21:22