0

I'm trying to load a Dask dataframe from a SQL connection. Per the read_sql_table documentation, it is necessary to pass in an index_col. What should I do if there's a possibility that there are no good columns to act as index?

Could this be a suitable replacement?

# Break SQL Query into chunks
chunks = []
num_chunks = math.ceil(num_records / chunk_size)

# Run query for each chunk on Dask workers
for i in range(num_chunks):
    query = 'SELECT * FROM ' + table + ' LIMIT ' + str(i * chunk_size) + ',' + str(chunk_size)
    chunk = dask.delayed(pd.read_sql)(query, sql_uri)
    chunks.append(chunk)

# Aggregate chunks
df = dd.from_delayed(chunks)
dfs[table] = df
jrdzha
  • 161
  • 2
  • 12

1 Answers1

1

Unfortunately, LIMIT/OFFSET is not in general a reliable way to partition a query in most SQL implementations. In particular, it is often the case that, to get to an offset and fetch later rows from a query, the engine must first parse through earlier rows, and thus the work to generate a number of partitions is much magnified. In some cases, you might even end up with missed or duplicated rows. This was the reasoning behind requiring boundary values in the dask sql implementation.

However, there is nothing in principle wrong with the way you are setting up your dask dataframe. If you can show that your server does not suffer from the problems we were anticipating, then you are welcome to take that approach.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Ah I see, so is there another recommended / more reliable way of doing this without using LIMIT/OFFSET then? Thanks! – jrdzha May 25 '20 at 18:09
  • Use an index :) Seriously, both from Dask's point of view, from the view of reliable query construction and execution, and from the DB engine standpoint, using indexes is the way to go. You could specify an index_col which is not indexed in the DB, of course, but performance will suffer. – mdurant May 25 '20 at 18:16
  • (and I mean, each partition picked by some `WHERE val <= COL < val`, which is what read_sql_table does internally) – mdurant May 25 '20 at 18:19
  • I see, thanks for your help! As a followup question, with the way Dask's read_sql_table sets up separate connections for each partition, wouldn't it be possible for new records to be inserted into the table as it's being queried by the Dask workers, causing an inconsistent view of the table? – jrdzha May 25 '20 at 18:44
  • I don't think there's any way to specify "do this query on the data as it was at time X", yes indeed some partitions may contain new data and others not, but at least we guarantee that each row in a partition does indeed satisfy its boundary conditions. – mdurant May 25 '20 at 18:47