0

Every developer working with databases has this problem. And often you can not estimate how many records a table will have in x years.

Especially when working with O/R mappers, this is extremely uncomfortable!

Why does the RDBMS driver don't solve this problem? Why multiple transfer a record instead of once and then reference it. For the client application this could be completely transparent. Or even offer advanced functions. Especially with OR mapper, it might even be very useful to create subbeans analogous to the DB data, only as a reference.

It would be wonderfull if you could join 1:n tables without the strain knowing of redudant data.

Does anyone know a RDBMS that optimizes like this? Or can't this be done? And if so, why?

----- ---- EDIT ----- -----
@Thilo: Thank for the link. Very interesting.

I've run a test with XAMPP for Windows.
PHP: 5.4.7
MySQL: 5.5.27
The result shows that you have to be carefull with JOIN in MySQL.

Everytime you do a JOIN you will get duplicated data (except 1:1). Why transfering this data multiple?

Test:

I created two tables. Table A with 500 records and 9 columns with VARCHAR(32) and Table B with 50000 records. (1:100)

SET @numA = 500;
SET @numBperA = 100;

DROP TABLE IF EXISTS `table_b`;
DROP TABLE IF EXISTS `table_a`;

DROP PROCEDURE IF EXISTS fill_table_b;
DROP PROCEDURE IF EXISTS fill_table_a;


CREATE TABLE `table_a` (
  `id`   int(11)     NOT NULL,
  `val1` varchar(32) NOT NULL,
  `val2` varchar(32) NOT NULL,
  `val3` varchar(32) NOT NULL,
  `val4` varchar(32) NOT NULL,
  `val5` varchar(32) NOT NULL,
  `val6` varchar(32) NOT NULL,
  `val7` varchar(32) NOT NULL,
  `val8` varchar(32) NOT NULL,
  `val9` varchar(32) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1;

delimiter $$
CREATE PROCEDURE fill_table_a()
BEGIN
    DECLARE i INT DEFAULT 1;
    SET i = 1;
    WHILE ( i <= @numA) DO
        INSERT INTO table_a (id, val1, val2, val3, val4, val5, val6, val7, val8, val9)
        VALUES (i, md5(rand()), md5(rand()), md5(rand()), md5(rand()), md5(rand()), md5(rand()), md5(rand()), md5(rand()), md5(rand()));
        SET i=i+1;
    END WHILE;
END$$
delimiter ;

call fill_table_a();


CREATE TABLE IF NOT EXISTS `table_b` (
  `id`         int(11)     NOT NULL AUTO_INCREMENT,
  `table_a_id` int(11)     NOT NULL,
  `val`        varchar(32) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `table_a_id` (`table_a_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

ALTER TABLE `table_b` ADD CONSTRAINT `table_b_ibfk_1` FOREIGN KEY (`table_a_id`) REFERENCES `table_a` (`id`);


delimiter $$
CREATE PROCEDURE fill_table_b()
BEGIN
    DECLARE i INT DEFAULT 1;
    DECLARE j INT DEFAULT 1;
    SET i = 1;
    WHILE (i <= @numA) DO
        SET j = 1;
        WHILE (j <= @numBperA) DO
            INSERT INTO table_b (table_a_id, val)
            VALUES (i, md5(rand()));
            SET j=j+1;
        END WHILE;
        SET i=i+1;
    END WHILE;
END$$
delimiter ;

call fill_table_b();

Now I want to select 300 rows from table A and the dependent 30000 rows from table B.

I've done this 3 ways:

Select A JOIN B with a single request

$time = microtime(true);
for( $i = 0; $i < 50; $i++ ) {
  $resultA = mysqli_query($link, "SELECT * FROM table_a LEFT JOIN table_b ON table_b.table_a_id = table_a.id WHERE table_a.id BETWEEN 100 AND 399");
  $resultArray = array();
  //while( $resultArray[] = mysqli_fetch_assoc($resultA) ) {}
  $numRows = mysqli_num_rows($resultA);
}
$time2 = microtime(true);
echo("numSelectedRows: " . $numRows . "<br>time: " . number_format($time2 - $time, 3) . " sec.<br>Memory: " . number_format(memory_get_peak_usage() / 1024 / 1024, 3) . " MiB");
  • with fetch
    numSelectedRows: 30000
    time: 15.539 sec.
    Memory: 55.649 MiB

  • without fetch
    numSelectedRows: 30000
    time: 6.262 sec.
    Memory: 3.431 MiB

Select A with single request. Iterate over Result and make 300 Requests to Table B.

$time = microtime(true);
for( $i = 0; $i < 50; $i++ ) {
  $numRowsB = 0;
  $resultA = mysqli_query($link, "SELECT * FROM table_a WHERE table_a.id BETWEEN 100 AND 399");
  while( $row = mysqli_fetch_assoc($resultA) ) {
    $resultB = mysqli_query($link, "SELECT * FROM table_b WHERE table_b.table_a_id = " . $row['id']);
    while( mysqli_fetch_assoc($resultB) ) {}
    $numRowsB += mysqli_num_rows($resultB);
  }
}
$numRowsA = mysqli_num_rows($resultA);
$time2 = microtime(true);
echo("numSelectedRows A: " . $numRowsA . "<br>numSelectedRows B: " . $numRowsB . "<br>time: " . number_format($time2 - $time, 3) . " sec.<br>Memory: " . number_format(memory_get_peak_usage() / 1024 / 1024, 3) . " MiB");
  • with fetch
    numSelectedRows A: 300
    numSelectedRows B: 30000
    time: 7.713 sec.
    Memory: 0.364 MiB

Select A with single request. Select B with single request.

$time = microtime(true);
for( $i = 0; $i < 50; $i++ ) {
  $resultA = mysqli_query($link, "SELECT * FROM table_a WHERE table_a.id BETWEEN 100 AND 399");
  $resultB = mysqli_query($link, "SELECT * FROM table_b WHERE table_b.table_a_id BETWEEN 100 AND 399");
  $resultArray = array();
  //while( $resultArray[] = mysqli_fetch_assoc($resultA) ) {}
  //while( $resultArray[] = mysqli_fetch_assoc($resultB) ) {}
}
$numRowsA = mysqli_num_rows($resultA);
$numRowsB = mysqli_num_rows($resultB);
$time2 = microtime(true);
echo("numSelectedRows A: " . $numRowsA . "<br>numSelectedRows B: " . $numRowsB . "<br>time: " . number_format($time2 - $time, 3) . " sec.<br>Memory: " . number_format(memory_get_peak_usage() / 1024 / 1024, 3) . " MiB");
  • with fetch
    numSelectedRows A: 300
    numSelectedRows B: 30000
    time: 6.020 sec.
    Memory: 15.928 MiB

  • without fetch
    numSelectedRows A: 300
    numSelectedRows B: 30000
    time: 3.018 sec.
    Memory: 1.156 MiB

Community
  • 1
  • 1
Stahlkocher
  • 219
  • 4
  • 10
  • It seems that Oracle does something like this: http://jonathanlewis.wordpress.com/2010/05/07/sqlnet-compression/ – Thilo Jun 15 '13 at 05:19

2 Answers2

2

Why does the RDBMS driver don't solve this problem?

For the same reason that the query optimizer also sometimes doesn't get it right.

This is hard.

The database (and other software) does its best to optimize query execution, but sometimes you still need to manually "help it".

If anything, I'd prefer only the database itself to even try its hand here, and not have other layers (such as the OR/mapper or the database driver) also fiddle around "automagically". Otherwise the process would become completely unpredictable and hard to steer if necessary.

Thilo
  • 257,207
  • 101
  • 511
  • 656
  • It's not about query execution optimization. Its about transfering redundant data over the socket. Even with no knowledge over the underlying data-structure, the redundant data should be easy to compress. Why can the JOIN be slower then? – Stahlkocher Jun 14 '13 at 23:34
  • I would not be surprised if network protocols use some form of compression already. But is that (overhead of sending repeated data over the network) really a concern compared to other things happening in a database query? – Thilo Jun 15 '13 at 05:16
  • "I would not be surprised if network protocols use some form of compression already": http://jonathanlewis.wordpress.com/2010/05/07/sqlnet-compression/ "SQL*Net has been able to do compression (or rather “de-duplication”) for years, at least since 8.0. If a single session data unit (SDU) contains data with a suitable pattern of duplicates SQL*Net can remove duplicates before transmission (and put them back on receipt, luckily)." – Thilo Jun 15 '13 at 05:18
1

Because RDBMS drivers do not return structured entities but generic datasets and have no way to know how the returned datafields are related to one another : is the count of some related rows to be persisted in the application just because it is somewhat-named in the query ?

Were that to be so easy, someone would already have discover the mysteries of the Universe and I would gladly sign the form and get to retirement !! ;))

Sadly what queries you need to do depends on schema, data repartition and last but not-at-all the least business rules and application features. So in some cases you will need joins with or without grouping and otherwise multiple queries will be better and so be it.

Max
  • 685
  • 4
  • 14
  • Yes, the driver provides unstructured data to the application. But who said, that the connection from the driver to the DB has to be all about unstrucktured data? I'm not deep into DB-internals. I'm just wondering, why even a simple JOIN of two tables can be slower than a first statement and than further statements for each row. Maybe I'm wrong but I believed that this is a result of the larger amount of data being transfered with the JOIN. Because of all the duplicated rows from the second table. Am I wrong? – Stahlkocher Jun 14 '13 at 23:48
  • It depends on many things : how much data in the tables, how indexes are used in your query, how much data is filtered out and returned in your set, if the query is cached, etc, etc. Both are different tools that respond to different needs. Finaly it's up to you to decide what is better in each case. I rarely found join queries that can not be optimized and that are finally much more performant in my application layer than multiple queries when I need to retrieve a lot of data. With multiple queries, you tend to lose time in communication, query compilation and optimization. – Max Jun 15 '13 at 06:09