Fast search within XML files in a shared folder

Question

I need to design a windows application that will reside within an organization's intranet. The application will be deployed on a user's machine and the user will be generating output within an XML file that has a predefined schema. This XML will be written out to a networked folder that will be accessible by other users. These files are named userid_output.xml. The "userid" is pulled from the application environment. While using the application a user should have the capability to search all the XMLs generated by the universe of users until that point. The information retrieved will drive the user to shape his/her application input. A very firm requirement is not to use any RDBMS(Oracle/Sql Server/MySql et al) to store the XML. The shared network folder is "THE REPOSITORY" and is only used for storing the XMLs.The machine hosting the shared folder may not run any services that may assist with indexing the XMLs or optimizing the data for search purposes.

Given these limitations, does anybody know of any design techniques/tools/mechanisms to perform fast information retrieval from this "dataset"?

Thanks

Those sound like some pretty awful requirements. Are we to understand that the point is to parse your way through untold XML files on a disk to perform a search in the fastest possible way, without any indexing at all? — StriplingWarrior, Jul 26 '10 at 15:10
If I were to be hobbled to such an extent, I would say goodbye, explaining that I can't do my job without the tools of the trade. — Oded, Jul 26 '10 at 15:12
Another way I thought was to somehow selectively extract the XML on the client site and perform the search in memory but if >2000 users are outputting GBytes worth of data everyday, this approach will fail pretty fast. — sc_ray, Jul 26 '10 at 15:20
Nope. The requirement states nothing besides the XMLs produced by the application. How would you generate separate indexing files for the XMLs though? — sc_ray, Jul 26 '10 at 15:27
You can index XML docs using XPaths. They have to be parsed and an index file built. Still for the volume you are talking about even if you could build index files it doesn't sound workable. I think you are doomed by the requirements if you are really talking about 2000 users and GBytes of data. — Kevin Gale, Jul 26 '10 at 15:30
Also even XPaths are not a huge speed up since using an XPath still requres parsing the XML yet again. They can speed things up and help find which file to parse but they are not like a database index. — Kevin Gale, Jul 26 '10 at 15:33

Daniel Haley · Accepted Answer · 2010-07-26T19:42:45.750

1

You could use XQuery. The collection() function allows you to query a directory of XML files.

Here's an example using Saxon. (I'm not sure if other implementations would be the same.):

collection("file:///C:/sample_xml?select=*.xml;")

This would select all of the *.xml files in the C:\sample_xml directory.

You could also narrow down the results by using XPath:

collection("file:///file://///srv1/dir1/sample_xml?select=*.xml;")/doc/sample1[@id='someID']

This would return only the sample1 elements that had an attribute id that was equal to someID.

edited Jul 26 '10 at 19:42

answered Jul 26 '10 at 17:39

Daniel Haley

51,389
6
69
95

Thanks. I have no prior experience using XQuery but in your snippet above are you using collection() to form an in-memory representation of the xml files in the C:\sample_xml directory which is stored on the client's machine? What happens if we have 7000 sample.xml files and are only interested in the value of the tag where the attribute id is equal to "someId"? How does XQuery help with returning that subset in an optimized manner without imposing a tremendous amount of overhead? – sc_ray Jul 26 '10 at 18:21
How does XQuery differ from something like Linq2Xml? – sc_ray Jul 26 '10 at 18:22
@sc_ray - Sorry, I have no experience with Linq2Xml. I will add another example to my answer to show what I would do to narrow down the results. – Daniel Haley Jul 26 '10 at 19:36
I also used a UNC path in the second example to show how I would access the network directory. – Daniel Haley Jul 26 '10 at 19:46
Thanks. But is XQuery doing the heavy lifting on the network folder itself or does it "select, transfers, and then processes a massive amounts of data". I was reading something along these lines in the following post http://stackoverflow.com/questions/214060/using-xquery-in-linq-to-sql – sc_ray Jul 27 '10 at 13:01

Fast search within XML files in a shared folder

1 Answers1