I have to parse an XML file with a large number of string values. For example:
<value>Foo</value>
<value>Bar</value>
<value>Baz</value>
<value>Foo</value>
Some of them are equal. There are multiple recurring strings, not just one as in the example above. Hence I would like to detect such values, and link them with XLink: to create a reference at one of the instances of a recurring string (doesn't have to be at the first one), and to link the rest (I can use UUIDs), like here:
<value id="D5494447-A010-4F81-9DDA-E5DFFBD616FF">Foo</value>
<value>Bar</value>
<value>Baz</value>
<value href="#D5494447-A010-4F81-9DDA-E5DFFBD616FF"/>
I am starting with XLinks so perhaps the above doesn't make sense. If that is not possible, another possibility is that I can create a dictionary containing such values:
{'D5494447-A010-4F81-9DDA-E5DFFBD616FF' : 'Foo'}
And then somehow put them in the XML. What is the simplest way to achieve these? I don't care much about the most efficient way as long as the method is correct and simple to implement, since I am a Python beginner and not a computer scientist, and computational complexity is not an issue. Parsing and writing XMLs is not an issue (I figured it out with lxml), so the question here is only about the detection of recurring strings and their linking.