How to automatically break down a SQL-like query with many joins into discrete, independent steps?

Question

Note: This is a learning exercise to learn how to implement a SQL-like relational database. This is just one thin slice of a question in the overall grand vision.

I have the following query, given a test database with a few hundred records:

select distinct "companies"."name"
from "companies"
inner join "projects" on "projects"."company_id" = "companies"."id"
inner join "posts" on "posts"."project_id" = "projects"."id"
inner join "comments" on "comments"."post_id" = "posts"."id"
inner join "addresses" on "addresses"."company_id" = "companies"."id"
where "addresses"."name" = 'Address Foo'
and "comments"."message" = 'Comment 3/3/2/1';

Here, the query is kind of unrealistic, but it demonstrates the point which I am trying to make. The point is to have a query with a few joins, so that I can figure out how to write this in sequential steps.

The first part of the question is (which I think I've partially figured out), is how do you write these joins as a sequence of independent steps, with the output of one fed into the input of the other? Also, is there more than one way to do it?

// step 1
let companies = select('companies')
// step 2
let projects = join(companies, select('projects'), 'id', 'company_id')
// step 3
let posts = join(projects, select('posts'), 'id', 'project_id')
// step 4
let comments = join(posts, select('comments'), 'id', 'post_id')
// step 5
let finalPosts = posts.filter(post => !!comments.find(comment => comment.post_id === post.id))
// step 6
let finalProjects = projects.filter(project => !!posts.find(post => post.project_id === project.id))

// step 7, could also be run in parallel to step 2 potentially
let addresses = join(companies, select('addresses'), 'id', 'company_id')

// step 8
let finalCompanies = companies.filter(company => {
  return !!posts.find(post => post.company_id === company.id)
    && !!addresses.find(address => address.company_id === company.id)
})

These filters could probably be more optimized using indexes of some sort, but that is beside the point I think. This just demonstrates that there seem to be about 8 steps to find the companies we are looking for.

The main question is, how do you automatically figure out the steps from the SQL query?

I am not asking about how to parse the SQL query into an AST. Assume we have some sort of object structure we are dealing with, like an AST, to start.

How would you have to have the SQL query in structured object form, such that it would lead to these 8 steps? I would like to be able to specify a query (using a custom JSON-like syntax, not SQL), and then have it divide the query into these steps to divide and conquer so to speak and perform the queries in parts (for learning how to implement distributed databases). But I don't see how we go from SQL-like syntax, to 8 steps. Can you show how that might be done?

Here is the full code for the demo, which you can run with psql postgres -f test.sql. The result should be "Company 3".

Basically looking for a high level algorithm (doesn't even need to be code), which describes the key way you would break down some sort of AST-like object representation of a SQL query, into the actual planned steps of the query.

My algorithm looks like this in my head:

represent SQL query in object tree.
convert object tree to steps.

I am not really sure what (1) should be structured like, and even if we had some sort of structure, I'm not sure how to get that to complete (2). Looking for more details on the implementations of these steps, mainly step (2).

My "object structure" for step 1 would be something like this:

const query = {
  select: {
    distinct: true,
    columns: ['companies.name'],
    from: ['companies'],
  },
  joins: [
    {
      type: 'inner join',
      table: 'projects',
      left: 'projects.company_id',
      right: 'companies.id',
    },
    ...
  ],
  conditions: [
    {
      left: 'addresses.name',
      op: '=',
      right: 'Address Foo'
    },
    ...
  ]
}

I am not sure how useful that is, but it doesn't relate to steps at all. At a high level, what kind of code would I have to write to convert that object sort of structure into steps? Seems like one potential avenue is do a topological sort on the joins. But then you need to combine that with the select and conditions somehow, not sure how you would even begin to programmatically know what step should be before what other step, or even what the steps are. Maybe if I somehow could break it into known "chunks", then it would be simple to apply TOP sort to it after that, but then the question is still, how to get into chunks from the object structure / SQL?

Basically, I have been reading about the theory behind "query planning and optimization", but don't know how to apply it in this regard. How did this site do it?

One aspect is breaking at least the where conditions into CNF.

"*Here, the query is kind of unrealistic*" that query is quite realistic. SQL is *relational* so it is normal for a query to have many joins. SQL is optimized around that. A better question might be to show what sort of problem you're having with joins that lead you to want to write your own SQL server? — Schwern, Jan 24 '22 at 01:20
I meant it was unrealistic because of the `where` conditions testing for random strings lol. I want to implement a distributed peer-to-peer database, with support for SQL-like queries across browsers, iPhones, etc. devices, so I need to know how to _implement_ a SQL sort of relational database. This is just one small slice to the puzzle. — Lance, Jan 24 '22 at 01:30

Schwern · Accepted Answer · 2022-01-24T01:50:21.180

1

Implementing joins is a huge topic which is probably out of scope for a StackOverflow answer.

If you're looking for practical information about how joins are implemented, I would suggest...

The Join Operation section of Use The Index, Luke for different types of join implementation.
Section 7 of the The SQLite Query Optimizer Overview covers joins. And reading the SQLite source code. It is about as small a practical SQL implementation will get.
The output of explain in Postgresql gives very detailed information about how it has implemented the query. And they are explained in Operator Optimization Information

edited Jan 24 '22 at 01:50

answered Jan 24 '22 at 01:16

Schwern

153,029
25
195
336

You missed the point of my question. I want to know how to IMPLEMENT a SQL database... – Lance Jan 24 '22 at 01:22
1

@Lance I would recommend looking at the source code of one, perhaps SQLite. – Schwern Jan 24 '22 at 01:22
That is going to take weeks or months of dedicated effort to glean the high level algorithm I am asking for, reading through C or something. Hence this question... I don't have the skill or luxury of reading that much SQLite implementation. Even though I will probably do it over the next few _years_. – Lance Jan 24 '22 at 01:23
@Lance Relational databases are a huge topic. SQL databases have had 50 years to develop their art and do what they do so efficiently. I would recommend looking for a book on database design. I offer the SQLite code because it is small (for a SQL database) and practical. The SQLite documentation contains a lot of practical information about it's implementation. See the [Technical and Design Documentation](https://sqlite.org/docs.html). And pick up a book on database design. – Schwern Jan 24 '22 at 01:26
I have already read a book on database design, and it doesn't go into enough detail. I know it is a huge topic, which is why I broke it down into this extremely narrow slice of a question, focusing only on how to do a SQL -> relational algebra tree basically. That is an extremely thin slice, which someone who knows about database theory _as applied in practice_ probably would quickly know. For me though, it is still a mystery after reading books and papers. – Lance Jan 24 '22 at 01:28
2

@Lance If it was easy, the SQLite code would be smaller. :) I hope you get an answer. I would suggest scaling back your question and making it clear that this is a learning exercise. – Schwern Jan 24 '22 at 01:31
2

@Lance Looking at the code for [SQL.js](https://sql.js.org/) might be useful. It is an in-browser SQL database written in Javascript. It should be simple enough to comprehend. – Schwern Jan 24 '22 at 01:33
Oh cool, good ideal, that might help. – Lance Jan 24 '22 at 01:34
Nope, nevermind, "sql.js uses emscripten to compile SQLite to webassembly". It's just converting the SQLite codebase to webassembly :) – Lance Jan 24 '22 at 01:35
1

@Lance Section 7 of ["The SQLite Query Optimizer Overview"](https://sqlite.org/optoverview.html) covers joins. "*The current implementation of SQLite uses only loop joins. That is to say, joins are implemented as nested loops.*" – Schwern Jan 24 '22 at 01:40
@Lance I've added a bunch of resources to the answer off the top of my head. Understanding Postgres `explain` should provide what you're looking for. – Schwern Jan 24 '22 at 01:52
Okay will take a look. Meanwhile I will leave [this](https://github.com/lancejpollard/rel.js), which is where I'm headed. – Lance Jan 24 '22 at 01:57
@Lance looks cool, and apologies if my resources are too basic – Schwern Jan 24 '22 at 17:34

How to automatically break down a SQL-like query with many joins into discrete, independent steps?

1 Answers1