6

We are working on a system where a person can send a document to another person, a document can have multiple attachments, as given below.

Document {
       Set<Attachment> attachments;
}

If X sends a document (Doc1,Ver1) to Y and Y edits the document then we have to create a new version (Doc1,Ver2) of the document so that X sent box doesn't reflect the changes made by Y.

In our system there will be millions of documents where each document can have hundreds of attachments. Also a document can travel through n number of persons over a period of time.

We should be able to fetch any document version along with the list of attachments it had at that time, so I have to maintain versions of document, so the table structure that came to my mind immediately was the one below.

Document - id primary key, ver_id primary key
Attachment - id, doc_id foreign key, doc_ver_id foreign key

But the problem with this structure is that if X is sending a document to Y with 100 attachment and Y has made some minor modification then I have to create a new version along with copying all the attachments for the new version, most of which are same as the one in previous version, since we will be having millions of documents and each document will move through n number of persons this model will result in a very huge attachment table with lot of redundant data.

So we thought of an alternate structure for attachment, as below.

Document - id primary key, ver_id primary key
Attachment - id, doc_id, attached_ver_id, detached_version_id

But I couldn't create a hibernate entity for Document with this structure, so my question is, is there any other table structure which is better equipped to solve this problem with less redundancy and is it possible to create a hibernate mapping for the above table structure.

Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
msmani
  • 730
  • 1
  • 7
  • 24
  • Why do you have a version_id in the document? And why is it listed as a primary key? Wouldn't it be sufficient to have a Document with an ID-primary-key, and then have an Attachment with an ID-primary-key and a version? This gives you two possibilities... 1. Your revision history is linear. You had as previous revision all Attachments with a smaller doc-id. 2. Your revision history is a tree. Then you can add an Attachment-FK in the Attachment table to the previous version of the Document. – Nathan Sep 14 '17 at 08:35
  • document can have multiple versions, that's why version_id is there as a primary key. – msmani Sep 14 '17 at 08:39
  • But doesn't every version of the document also result in an attachment? – Nathan Sep 14 '17 at 10:01
  • Yes that's the exact problem, I have put `version_id` to convey that there are multiple versions, I could've removed `version_id` for document and created a new document instead then I have to store `previous_document_id`. My question now is, when I create the new version of the document I have to clone all the attachments as well, this will result in a lot of duplicate entries in the attachment table, is there any way to overcome that, my second table structure solves this problem somewhat but I couldn't create a hibernate entity for that structure. – msmani Sep 14 '17 at 10:11

3 Answers3

2

Background

When data history needs to be kept there are usually two possible approaches:

Method #1: Cloning

When a new entry is created, its details are copied from the most recent existing entry.

Method #2: Deltas

The details for the very first entry are stored. Each subsequent entry stores the changes from the previous version.

Pros/Cons:

Method #1 is generally simpler and faster as the details for any record can be looked up directly without needing to build them up. But Method #2 uses less storage. (Probably worth noting that in my experience Method #1 has always been preferable because simplicity and speed of retrieval is usually more of an important consideration than storage).

What is being asked?

My understanding is you started off with Method #1 but now prefer Method #2.

Answer

But I couldn't create a hibernate entity for Document with this structure, so my question is, is there any other table structure which is better equipped to solve this problem with less redundancy and is it possible to create a hibernate mapping for the above table structure.

It should be perfectly possible to create entities for this database structure - each entity is of course just the Hibernate class representation of a database table. Would suggest adding a mapping table between Document Version and Attachment:

Document - id primary key, ver_id primary key, ...
Attachment - id primary key, ...
DocumentAttachmentDelta - doc_id, doc_ver_id, attachment_id, added_or_removed_flag

Here DocumentAttachmentDelta is an explicitly defined mapping table with @ManyToOne relationships linking to the primary keys identifying the document version and attachment. It has an additional boolean flag that specifies whether the attachment is being removed or added for this version of the document. So from above, for the first version of the document all its initial attachments would be added but for subsequent versions only the deltas would be stored, which could be additions or removals.

Entity details (following a request in the comments)

@Entity
class Document {
    /* ...other fields... */

    @OneToMany(cascade = CascadeType.ALL, mappedBy = "document", orphanRemoval = true)
    List<DocumentAttachmentDelta> documentAttachmentDeltas;
}

@Entity
class Attachment {
    /* ...other fields... */

    @OneToMany(cascade = CascadeType.ALL, mappedBy = "attachment", orphanRemoval = true)
    List<DocumentAttachmentDelta> documentAttachmentDeltas;
}

@Entity
class DocumentAttachmentDeltas {
    /* ...other fields... */

    @ManyToOne
    Document document;

    @ManyToOne
    Attachment attachment;
}
Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
  • You are saying the DocAttachment table will contain only the diff from the previous version. So, If Doc (v1) has attachments a1,a2,a3 and in Doc (v2) a2 has been removed and a4 has been attached then there will be only two entries made for v2 in DocAttachment table, in that case if i want to fetch the final list of attachments for v2 then it should be a1,a3,a4. For fetching I can use custom loader but for delete I have to actually make a insert, I can try using @SqlDelete but since version number has to be updated and SqlDelete can't be parameterized, how to achieve it? – msmani Sep 11 '17 at 18:24
  • Are you sure you want to use `@SqlDelete`? Personally I'd keep it simple and just require records to be added with the removed flag set when removing attachments. – Steve Chambers Sep 12 '17 at 06:57
  • The structure of Document is that it contains a set of attachments, new record will be added to the set for attachment and a record will be removed from the set for detachment, that's why I was thinking of using @SqlDelete but if I have to do as you said then the structure of the Document will not contain set of attachments instead it will contain a single set of attachmentsAndDetachments, that makes the manipulation of the set as somewhat unnatural, or am I wrong? – msmani Sep 12 '17 at 07:50
  • The mapping table is describing deltas from the previous version and shouldn't be confused with the actual set of attachments (which has to be built up from multiple entity instances and can't be determined from a single instance). In other words, to find the actual attachments for a particular document version you'd need a service that queries to find the relevant entities and then processes them - it can't be done at the entity level. As mentioned, this complexity is one of the penalties of the "deltas" approach. – Steve Chambers Sep 12 '17 at 08:48
  • In that case the Document won't be a single entity managed by Hibernate instead now I have to persist Document and its Collection separately, right?. That was my concern, I am looking for way to do it as a single entity, i.e if say session.save(document) it should persist Document and all the attachments. If I don't find any solution for this then as you said I have to do it separately. – msmani Sep 12 '17 at 08:56
  • It's possible to persist the Document and its attachment deltas in a single operation: In the Document entity, include a reverse link (`@OneToMany`) to the DocumentAttachments and set `cascade = CascadeType.PERSIST` (or `ALL`) on it. – Steve Chambers Sep 12 '17 at 09:15
  • I am not able to visualize what kind of Document structure you are referring to, are you asking me to create a separate entity for DocumentAttachment (i.e attachment deltas) table or use it as a join table only, no separate entity, for loading actual Attachments and will the Document entity contain `Set` or `Set` – msmani Sep 12 '17 at 09:32
  • Yes - `DocumentAttachment` would need to be a separate entity. Actually possibly a better name is `DocumentAttachmentDelta` (to hopefully avoid the confusion above). I've changed this in the answer and added some brief details of what the entities might look like at the bottom. – Steve Chambers Sep 12 '17 at 10:29
1

If I really wanted to use deltas, I would use the following model.

@Entity
public class Document {

    @Id
    private String id;

    @Lob
    private byte [] firstVersion; 

    @OneToMany(mappedBy = "document")
    private final Set<Attachment> attachments = Sets.newHashSet();
}

Along with an attachment entity as follows

@Entity
@Table(uniqueConstraints = {
        @UniqueConstraint(columnNames = { "document_id", "version" })
})
@Check(constraints = "(delta is null and previous_version_id is null) or (delta is not null and previous_version_id is not null)")
public class Attachment {

    @Id
    private Long id;

    @Column(nullable = false, name = "version")
    private Long version;

    @Lob
    @Column(name = "delta")
    private byte [] delta;

    @JoinColumn(name = "document_id")
    @ManyToOne(optional = false)
    private Document document;

    @JoinColumn(name = "previous_version_id")
    @ManyToOne(optional = true)
    private Attachment previousVersion;

}

This way, you have the original version of the document as created by the user. Then, each attachment references the previous version of the document, reflecting the changes in the delta field. When a user sends a version of the document (which is really just an attachment) to a user, you add many-to-many between your person/user entity and the Attachment. In this way, the sent version of the document can be reconstructed, along with all of its predecessors.

Since there can only be one initial version of the document, I would consider the possibility of a partial unique constraint (partial unique index in Postgres, for example) to enforce that there is only one attachment per document that has no delta and no parent. This cannot be modeled in Hibernate, however.

The check-constraint enforces that the version with no parent also has no delta, since it is exactly the firstVersion of the document contained within the Document entity. In this model, the "version" field is not strictly necessary, but could be useful when you want to have a name for a particular version of the document and force it to be unique per document (see the unique-constraint in my annotation).

However, I would probably solve this problem as follows:

@Entity
public class Document {

    @Id
    private String id;

    @OneToMany(mappedBy = "document")
    private final Set<Attachment> attachments = Sets.newHashSet();

}

@Entity
@Table(uniqueConstraints = {
        @UniqueConstraint(columnNames = { "document_id", "version" })
})
public class Attachment {

    @Id
    private Long id;

    @Column(nullable = false, name = "version")
    private Long version;

    @Lob
    @Column(name = "content")
    private byte [] content;

    @JoinColumn(name = "document_id")
    @ManyToOne(optional = false)
    private Document document;

    @JoinColumn(name = "previous_version_id")
    @ManyToOne(optional = true)
    private Attachment previousVersion;

}

Where I would still want a partial-unique-index for the document when previous_version_id is null, ensuring that there is only one initial version per document.

With both os these solutions, you avoid any cloning of documents. With the first (using deltas) you save a bit on space, because you only ever store the complete document contents of the first version of each document. However, with the second version, you have much easier retrieval of any particular version.

Nathan
  • 1,576
  • 8
  • 18
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/154518/discussion-between-nathan-kummer-and-wolverine). – Nathan Sep 15 '17 at 09:14
0

Or you define the relationship between Document and Attachment as @ManyToMany on both sides and ensure with a callback on the Java side that all the Attachment's documents have the same id and only differ in ver_id.

fhossfel
  • 2,041
  • 16
  • 24