0

I have a client/server application where data is exchanged in XML format. The size of data comes to around 50MB, most of which comprises of the XML tags themselves. Is there a way to take the generated XML and index the node names as follows:

<User><Assessments><Assessment ID="1" Name="some name" /></Assessments></User>

to:

<A><B><C ID="1" Name="some name" /></B></A>

This would save an incredible amount of bloat.

EDIT
This data is serialized from Entity Framework objects. The reason for choosing XML as the protocol was intrinsic support in .NET and smart code generation of FromXml and ToXml for entities to circumvent circular references.

Raheel Khan
  • 14,205
  • 13
  • 80
  • 168
  • 1
    Doing that would make no sense unless you have a way to undo the compression again. The whole file you just be a complete mess with random letters instead of tags. – OptimusCrime Jun 14 '12 at 11:51
  • 3
    GZipStreaming the content looks like more simple. Changing xml node names change the meaning of the xml content. – Steve B Jun 14 '12 at 11:53
  • where is the bloat that you are worried about? is it when they are persisted to disk (zip them), when they are transmitted (zip them) or when they are being processed in memory? – paul Jun 14 '12 at 11:53
  • 1
    How do you communicate with the server? HTTP allows compression while WCF can use the NetTcpBinding to pass data in a binary format instead of XML. You can also use Json to pass data as a much smaller text. – Panagiotis Kanavos Jun 14 '12 at 11:54
  • 2
    Instead of XML (since you don't seem to care too much about the format), you could look at JSON or Binary serialization instead? http://msdn.microsoft.com/en-us/library/bb738528(v=vs.90).aspx – StuartLC Jun 14 '12 at 11:55
  • You can always look at all nodes in the Xml document and map those strings to IDs (like the letters suggested by yourself); in a way that duplicate strings get the same ID. In the Xml document, you can replace the strings with the IDs. You'd then have to transfer the mapping of IDs and strings along with your so reduced Xml document. However, using a more widespread and less custom-built compression method is probably much more effective, and using another representation of the data structures (e.g. JSON) reduces size while retaining the exact original document structure. – O. R. Mapper Jun 14 '12 at 11:56
  • Googling for [Binary XML](http://en.wikipedia.org/wiki/Binary_XML) gives some results, including [Fast Infoset](http://www.noemax.com/products/fastinfoset/index.html). This might or might not be what you are looking for. – Eugene Ryabtsev Jun 14 '12 at 12:01
  • Why exactly are you worried about trying to make a 50MB file smaller? You seem to be trying to solve the wrong problem which honestly isn't actually a problem. – Security Hound Jun 14 '12 at 12:33

5 Answers5

4

What about just compressing/decompressing your data stream between the client and the server ? This will be easier to implement and much less error prone than to do some custom transformation on the xml data.

mathieu
  • 30,974
  • 4
  • 64
  • 90
1

You could look at using Attributes for your data rather than Elements. For example, if you have "gender" as an attribute you will get:

<person gender="female">
  <firstname>Anna</firstname>
  <lastname>Smith</lastname>
</person>

whereas if it is an Element you will get:

<person>
  <gender>female</gender>
  <firstname>Anna</firstname>
  <lastname>Smith</lastname>
</person>

It's not strictly correct but will achieve what you are after.

ChrisF
  • 134,786
  • 31
  • 255
  • 325
1

The point of XML is so that you don't need to compress/minimise the data. If you need to minimise what's going down the wire then there's a good chance your using the wrong protocol.

Obviously you can pass this through a gzip stream, which will get you a massive advantage, but if you want to squeeze even more out of it than that then it may be worth looking at JSON or even a binary format.

XML was designed to be readable by humans, and by removing the readability then your essentially removing one of the major reasons to use XML in the first place.

John Mitchell
  • 9,653
  • 9
  • 57
  • 91
0

Alternatively, you can also consider json instead of xml, which would take less size as compared to xml

Asif Mushtaq
  • 13,010
  • 3
  • 33
  • 42
0

I ended up writing a small class that renames the node names and creates a mapping element so the process can be reversed as well. That alone took the file size down from 50MB to 10MB.

Compressing the file would be the next step but I wonder how much space I could ave using Binary serialization. Have not tried that before.

Raheel Khan
  • 14,205
  • 13
  • 80
  • 168