How to store git branch like data structure in postgresql?

Question

I'm trying to build a wikipedia like thing, where multiple people can edit the content。People with privilege can also revert changes. I don't want reversions to be lossy (meaning really drop the edits people made. It should only hides it), so this seems to require a git branch like data structure store the edits with a pointer that points to the "current".

I tried this design:

CREATE TABLE article (
    id serial PRIMARY KEY,
    content text  NOT NULL,
    author integer NOT NULL REFERENCES "user",
    path text NOT NULL,
    relationship ltree NOT NULL
);

CREATE TABLE current_article (
    article_id NOT NULL REFERENCES article
);

Where relationship records if it's a new article or an edit of an existing article:

id | content | path | author | relationship
---+---------+------+--------+-------------
1  | foo     | /a1  | 1      | 'root'
2  | bar     | /a1  | 2      | 'root.1'
3  | baz     | /a2  | 3      | 'root'

Here, it means, author 2 changed article /a1 from foo to bar, and article /a2 is a new one.

current_article records which article is the "current" one, usually it just points to the newest one. After reversion, it can point to an older one:

article_id
----------
2
3

When an edit comes in, I insert it like this:

INSERT INTO article (content, path, author) VALUES ('qux', '/a2', 4);

And rely on a before insert trigger to find the current article for that path and fill in relationship, and an after insert trigger to update the current article pointer.

What do you think of this design? The problem I'm having with this design is the difficulty dealing with concurrency.

In the before insert trigger, by the time it finds the current article, it may already be changed, and in the after insert trigger, it might incorrectly overwrite current article with already points to a different one.

I have three questions in this regard:

Will serializable Isolation solve the problem? (I'm pretty new to the concept of MVCC, still trying to wrap my head around it) If not, how should I solve it?
Is there a better design that doesn't have to deal with concurrency?
If I do need to deal with concurrency, how can I unit test my design in different race conditions (or are such unit tests even necessary)?

Thank you.

Daniel Vérité · Answer 1 · 2017-10-02T16:00:07.487

Concurrency happens at two levels: the application and the database.

At the application level, several users may have edit sessions that overlap. At some point a user will save their version, then later the next one will save too, but there doesn't seem to be a way in your current design to figure out from which version the last edit is branching: this information is nowhere in the INSERT mentioned.

The concurrency at the database level is a different problem, it concerns transactions that are running at the same time.

If you were trying to solve the application concurrency with the databases primitives that deal with concurrency, you would have to keep open transactions until a user has finished editing, which means during arbitrary long times, and that's a non-starter in database design.

First you need to figure out an application and design strategy to deal with concurrent edits, then you'll need to figure out a database strategy to deal with concurrent transactions, that is when people hit "Save" at the same time and the transactions updating the data run in parallel. These are completely different things.

Concerning concurrent transactions, one generic way to avoid trouble is to lock the article at the beginning of the writing transaction before doing anything else, so that any other transaction that would try to do the same would be blocked until the concurrent change is committed (or rolled back). It's the simplest way to serialize updates, but it assumes that there is something to lock that has enough granularity for updates to other articles not being blocked at the same time.

Ideally there should be an article table with only one row per path (independantly of the revisions, which would be stored in a different table). Then locking that row with SELECT ... FOR UPDATE would be sufficient to guarantee that the set of queries doing the branching or whatever complex update can work without being bothered by a concurrent change to the same article.

Another (rougher) approach is to use the serializable isolation level and retry any transaction that fails with an SQLSTATE indicating a serialization failure.

You’re correct, an edit doesn’t know which version it’s based off. Clients should also send the article id they’re based on, I will update that. However, i don’t quite understand how application sessions are relevant. Two users can edit at the same time, if one saves first , the other will be notified about the conflict when she saves. So shouldn’t I only have to deal with database concurrency? Which is what i’m having difficulty with. — hgl, Sep 30 '17 at 23:50
@hgl: edited to add a bit more about the database concurrency aspect. — Daniel Vérité, Oct 02 '17 at 16:00

How to store git branch like data structure in postgresql?

1 Answers1