1

Does DuckDB support multi-threaded joins? I've configured DuckDB to run on 48 threads, but when executing a simple join query, only one thread is actively working.

Here is an example using the CLI API:

# setting up database relations
CREATE TABLE R AS SELECT * FROM read_csv_auto ('../datasets/facebook_combined.txt');
CREATE TABLE S AS SELECT * FROM read_csv_auto ('../datasets/facebook_combined.txt');
CREATE TABLE T AS SELECT * FROM read_csv_auto ('../datasets/facebook_combined.txt');
CREATE TABLE U AS SELECT * FROM read_csv_auto ('../datasets/facebook_combined.txt');
CREATE TABLE V AS SELECT * FROM read_csv_auto ('../datasets/facebook_combined.txt');

# set number of available threads
SET threads TO 48;

# check if configuration was successful
SELECT current_setting('threads');

# query that takes a while to compute
SELECT COUNT(*) FROM R JOIN S ON (R.column1 = S.column0) JOIN T ON (S.column1 = T.column0) JOIN U ON (T.column1 = U.column0) JOIN V ON (U.column1 = V.column0);

You can find the dataset that I used here:

SNAP Stanford graph dataset - Facebook

Many thanks!

I used DuckDB to join multiple tables. I expected the database to use multiple threads, as it was configured accordingly. However, only one thread was utilised.

cmq
  • 11
  • 4

1 Answers1

0

DuckDB does use multiple threads when joining in most cases, so there is likely another factor that is limiting that in some way.

Could you also include the explain and the explain/analyze plan? Just prepend your query with Explain and Explain analyze .

One other thing to try would be to output all of the joined data into a table (a temp table will keep it all in RAM if you'd like), and then do the count(*) on that resulting table. I have a hunch that the count(*) is what is limiting the parallelism.

CREATE TEMP TABLE my_table AS
SELECT 
* 
FROM R 
JOIN S ON (R.column1 = S.column0) 
JOIN T ON (S.column1 = T.column0) 
JOIN U ON (T.column1 = U.column0) 
JOIN V ON (U.column1 = V.column0);

SELECT COUNT(*) FROM my_table;