2

I have partitioned my data set into two separate sets of 5M rows. Each partition is loaded into a table on a machine of its own. I use a central monetdb instance where I register both tables as remote tables and add them to a merge table.

When I run a query on the merge table I would expect MonetDB to distribute the query, in parallel, to both partition tables. However, when looking at results created with tomograph I see that each remote table is queried sequentially.

I've compiled MonetDB myself using a recent source tarball. I've disabled geom and made sure embedded python was available. Other than that I've not changed any settings or configure flags. The two machine holding the partitions are 1 core VMs with 4GB memory. The central machine is my laptop, which has 4 cores and 16GB of memory. I have also run this experiment using a central node with the same configuration as the partitions.

I created the tables like this:

-- On each partition (X = {1, 2}):
CREATE TABLE responses_pX (
    r_id int primary key,
    r_date date,
    r_status tinyint,
    age tinyint,
    movie varchar(25),
    score tinyint
);

-- On central node:
CREATE MERGE TABLE responses (
    r_id int primary key,
    r_date date,
    r_status tinyint,
    age tinyint,
    movie varchar(25),
    score tinyint
);

-- For both partitions
CREATE REMOTE TABLE responses_pX (
    r_id int primary key,
    r_date date,
    r_status tinyint,
    age tinyint,
    movie varchar(25),
    score tinyint
) ON 'mapi:monetdb://partitionX:50000/partitionX';

ALTER TABLE responses ADD TABLE responses_pX;

I'm running the following queries on the central node:

SELECT COUNT(*) FROM responses;
SELECT COUNT(*), SUM(score) FROM responses;
SELECT r_date, age, SUM(score)/COUNT(score) as avg_score FROM responses GROUP BY r_date, age;

For all queries the parallelism reported by the tomograph tool is no higher than 2.11%.

vdeurzen
  • 21
  • 1

1 Answers1

0

yes, MonetDB uses parallel processing where possible. See the documentation https://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/DistributedQueryProcessing

mkersten
  • 694
  • 3
  • 7
  • That is the page I've used to set-up my tables, however, benchmarking shows remarkably little parallelism, where I would expect much more. For example I would expect the two partition tables to be queried in parallel, however, according to `tomograph` the two subqueries are run sequentially... Is there any explanation for that? Is there something that I might have misconfigured? – vdeurzen Aug 05 '16 at 15:58