21

I have been managing Subversion as an engineering document storage repository for my company. It is working fairly well, however I have a question about how MS Office 2007 formats are (should be) handled by Subversion.

I'm looking at an Excel 2007 spreadsheet (extension .xlsx) in my working copy that Subversion has applied the svn:mime-type property application/octet-stream. This means that Subversion is treated it as binary, right?

I was hoping that the new MS Office document formats would be stored efficiently by Subversion. My understanding is that a full copy of a binary file will be made on every commit of that file, whereas if the file is text, a small change to the file will result in a small amount of additional data being added to the repository (in a typical situation at least).

I don't understand much of the details of XML, but I thought that an XML file was text, and that it would therefore be efficiently stored by Subversion.

Is it possible to configure Subversion so that MS Office OpenXML documents are stored efficiently?

Follow-up (2009-11-09): I've found that Office documents can be stored as plain text using the Office 2003 XML document formats (Excel: XML Spreadsheet 2003; Word: Word XML Document. There is a warning about loss of formatting, but I have yet to encounter any noticeable loss of formatting.

RjOllos
  • 2,900
  • 1
  • 19
  • 29
  • see also: http://stackoverflow.com/questions/1320654/will-subversion-efficiently-store-openxml-office-documents – Dirk Vollmar Jun 15 '11 at 07:52
  • 4
    @0xA3 Are you applying [recursion](http://stackoverflow.com/questions/1320654/will-subversion-efficiently-store-openxml-office-documents#comment17301597_1320654)? – Tobias Kienzler Oct 09 '12 at 08:43
  • Note that "Word XML Document" is not the 2003 XML file format -- that is the 2007 Open XML flat package format -- it would be impossible to lose data when saving as this format as it can do everything a .docx can do. The Excel 2003 format on the other hand you do run the risk of losing data if the feature either doesn't exist in 2003, or doesn't exist in the 2003 XML format. – BrainSlugs83 Jun 13 '13 at 03:55

4 Answers4

28

From the OpenXML article on wikipedia:

An Office Open XML file is a ZIP-compatible OPC package containing XML documents and other resources.

In other words, OpenXML files are actually zip files with XML files in them. Compression or encryption "scrambles" the data, sabotaging subversion's ability to generate deltas between revisions. This is not related to the svn:mimetype. Subversion considers all files to be binary when generating deltas.

In Dutch we have a saying "measuring is knowing". The graph below shows the results of an experiment where I imported a 500K OpenXML document in a SVN 1.6 repository (revision 1). I then added a paragraph from another document, saved and committed. This was repeated 5 times (revision 2 to 6).

As you can see, committing a new docx revision that just adds a paragraph will cost you about 150K disk space. This is still much more efficient than just storing a copy of each revision without the help of a version control system.

I also repeated the experiment with a separate test repository by uncompressing each revision of the docx. As you can see, the storage of the document revisions would be much more efficient if it wasn't compressed. It's also interesting to see that subversion's own data compression is about as efficient as zip. Storing the first revision of an uncompressed docx in subversion takes about the same space as the original docx.

YMMV.

Community
  • 1
  • 1
Wim Coenen
  • 66,094
  • 13
  • 157
  • 251
  • 1
    Nice experiment! In Word 2007, if I choose Save As.. -> Other Formats, one of the options is Word XML Document (*.xml). This option saves the file as an XML document that can be viewed in WordPad. Word XML format appears to be different than uncompressing the DOCX OPC package. Anyone have input on the pros / cons of using Word XML format? I will repeat wcoenen's experiment with documents in the Word XML format, just to be sure. – RjOllos Aug 24 '09 at 18:47
  • 1
    From my experience in the past few weeks of working with OpenXML packages, the key difference is that .docx can store arbitrary (read: OLE) or OpenXML Package (read: other .docx & .xlsx) data within the container. You will not have this ability with WordprocessingML alone. – technomalogical Dec 14 '09 at 21:31
  • @technomalogical That is not correct. The Open XML Flat Package Format can store binary parts perfectly fine -- they are serialized as base64. Take a look for yourself. Everything a .docx package can do, an OPC can do as well. It may be the Word 2003 WordprocessingML format that you're thinking of (the two are not the same). – BrainSlugs83 Jun 13 '13 at 03:58
  • Maybe it's not phrased well, but that's exactly what I was saying. DOCX can store binary data, WordprocessingML alone cannot. – technomalogical Sep 03 '13 at 17:36
9

Subversion handles binary files quite well. It does not store a full copy for every commit but only an efficient binary diff.

See the FAQ about this.

datguy
  • 623
  • 8
  • 24
Stefan
  • 43,293
  • 10
  • 75
  • 117
  • After also reading the response from wcoenen, it makes me wonder if Office 2003 documents would be stored more efficiently by Subversion. Wcoenen's data shows that a duplicate (or nearly a duplicate) of the data file is being made (hypothesized to be because of the data compression). Since Subversion handles an ordinary binary file fairly well, it would be interested to repeat wcoenen's experiment with Office 2003 format documents, which I will try to do. – RjOllos Aug 24 '09 at 19:23
3

Sadly, you can't currently do this with Subversion, but there has been some discussion around this:

http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=651443

Jesse Weigert
  • 4,714
  • 5
  • 28
  • 37
  • +1 for the helpful link. Note that the discussion explains that binary diffs are used, but indicates the deltas may potentially be quite large. The main thing you lose is the ability to easily track changes between versions. – ire_and_curses Aug 24 '09 at 06:41
  • TortoiseSVN does diffs on Word and Excel files quite well. As of Office 2007, PowerPoint diffs are no longer supported, however. – RjOllos Jan 07 '10 at 18:28
-2

Have you ever tried to open an OpenXML file in a text editor?

To make it short: it is not text, it is still binary. So no, you can’t make Subversion handle it any different.

Bombe
  • 81,643
  • 20
  • 123
  • 127
  • This answer is actually not very helpful because it does not clarify RjOllos' confusion why a document called "XML" should be binary... – chiccodoro Apr 16 '10 at 08:46
  • 1
    This is misleading: the XML files ARE text - the problem is that a .docx file is actually a zip archive of the XML files (and other stuff). – André Chalella Aug 28 '12 at 08:56