Does string exist check 20k times

Question

I have an ITunes library XML file backup file - about 15 MB.

I have 20K music files on my C drive and about 25K files on E drive under exactly similar folder structures.

I am traversing the first location and going file by file and checking if the file exiss in the second location. That part works for me.

Now, for all such duplicate files, if the file path from E drive exists in the XML, but the C drive path does not exist in the XML, then I want to delete the file from the C drive.

What is my best way of checking if a string exists in the XML file (I have to do this atleast 20K times)?

Do you need to only verify that each string exists once, or do you need to count how many times each occurs? — Chris Laplante, Sep 10 '10 at 20:04
How many times must you do it? Once? Regularly? Should it be fast? 15MB isn't that much these days. — Kobi, Sep 10 '10 at 20:05
When you say "best way", what does "best" mean? Have you tried loading them into a `HashSet`, and if so what was wrong with doing that? — ChrisW, Sep 10 '10 at 20:07
Are you checking full words or the complete contents of nodes only? Or do you need to be able to find substrings? In the first case you might want to consider parsing the XML file into a hash set or a trie data structure which both allow efficient checking whether a specific entry exists. — Dirk Vollmar, Sep 10 '10 at 20:10
You're all itching to provide a solution once he answers these questions, aren't you? :) — Cᴏʀʏ, Sep 10 '10 at 20:13
@SimpleCoder just checking if string exists - count not required. — Raj More, Sep 10 '10 at 20:17

Chris Laplante · Answer 1 · 2010-09-10T20:23:01.810

Depending on if you want to count how many times a string occurs, or if you are merely checking for the existence of the strings, your approach will be slightly different. But, these are the two ways I would consider doing it:

If you want to do it with minimal memory:

Load the file line by line (or, if your XML is not formatted like this, node by node using an XML parser... I believe there are XML parsers that can do this). Do a search operation on the line for each string. No more than one line/node will be in memory at a time, if you correctly overwrite the last line. The downside to this is that it will take longer and the file will be open longer.

If you want to do it fast:

Load the entire file into memory, don't bother parsing it, and just search for each string.

EDIT

Based on your clarifications, I would first collect all duplicate file names in an array, and then proceed to scan each line of the XML file using my first method (above). If you are already storing 20K file names in memory, I would hesitate to load the entire 15MB XML at the same time.

-1 for claiming that searching the whole of memory, repeatedly, for each string will be "fast". — ChrisW, Sep 10 '10 at 20:28
@ChrisW: -1 to you for poor reading comprehension. I said, in my Edit, to load each node/line one by one and scan for each string in the line. — Chris Laplante, Sep 10 '10 at 20:30

score 2 · Answer 2 · answered Sep 10 '10 at 20:14

2

A suggestion: load as text, use a regular expression to extract the desired strings (I suppose they are enclosed with a specific tag) and build a hash list with them. You can use the list to check the existence.

answered Sep 10 '10 at 20:14

Eduardo Mauro

1,515
1
26
38

1

-1 for using regular expression instead of XML API to extract strings. – ChrisW Sep 10 '10 at 20:27
1

@ChrisW: The question does not impose the use of XML API. Also, the question was edited after my answer. In the original question I was said that was not necessary to read as XML. So I don't agree with yours -1 for my answer. – Eduardo Mauro Sep 10 '10 at 20:37
It's his having an XML file as input that suggests using one of the built-in XML APIs. – ChrisW Sep 10 '10 at 20:43
1

@ChrisW: As I said, in the original question he said it was not necessary to read as a XML file. – Eduardo Mauro Sep 10 '10 at 20:48

score 2 · Answer 3 · answered Sep 10 '10 at 21:43

Here is a simple solution using Linq. Runs sufficiently fast enough for one-time use:

using System;
using System.IO;
using System.Linq;
using System.Xml.Linq;

class ITunesChecker
{
    static void Main(string[] args)
    {
        // retrieve file names
        string baseFolder = @"E:\My Music\";
        string[] filesM4a = Directory.GetFiles(baseFolder, "*.m4a", SearchOption.AllDirectories);
        string[] filesMp3 = Directory.GetFiles(baseFolder, "*.mp3", SearchOption.AllDirectories);
        string[] files = new string[filesM4a.Length + filesMp3.Length];
        Array.Copy(filesM4a, 0, files, 0, filesM4a.Length);
        Array.Copy(filesMp3, 0, files, filesM4a.Length, filesMp3.Length);

        // convert to the format used by iTunes
        for (int i = 0; i < files.Length; i++)
        {
            Uri uri = null;
            if (Uri.TryCreate(files[i], UriKind.Absolute, out uri))
            {
                files[i] = uri.AbsoluteUri.Replace("file:///", "file://localhost/");
            }
        }

        // read the files from iTunes library.xml
        XDocument library = XDocument.Load(@"E:\My Music\iTunes\iTunes Music Library.xml");
        var q = from node in library.Document.Descendants("string")
                where node.ElementsBeforeSelf("key").Where(n => n.Parent == node.Parent).Last().Value == "Location"
                select node.Value;

        // do the set operations you are interested in
        var missingInLibrary = files.Except(q, StringComparer.InvariantCultureIgnoreCase);
        var missingInFileSystem = q.Except(files, StringComparer.InvariantCultureIgnoreCase);
        var presentInBoth = files.Intersect(q, StringComparer.InvariantCultureIgnoreCase);
    }
}

Jimmy Hoffa · Accepted Answer · 2010-09-10T20:30:57.830

Alphabetically sort your list of strings that you're matching on, then build an index array which tells you where the start of your list is for each character that is a starting character for one of the strings, maybe indexing to the second character depending on the breadth of variety and if your match is case sensitive or not.

Read the file character by character with a stream to minimize memory footprint, checking into the index array to see where that character starts and ends in the list of strings so you can pull out that characters page, if there's anything starting with those character combinations. Then continue filtering inside of the page until you have one match left and the next character makes matches 0.

Remove that string from the list of strings to match, put it in another list if you want. Then start checking your index on the next character and continue doing so each time you run into no matches.

The index gives you a more efficient aggregate to minimize number of items iterated against.

This could give you a two character depth index:

Dictionary<string,int> stringIndex = new Dictionary<char,int>();
for(int i = 0; i < sortedSearchStrings.Length; i++;)
{
    if (!stringIndex.Keys.Contains(sortedSearchStrings[i][0])) stringIndex[sortedSearchStrings[i][0]] = i;
    if (!stringIndex.Keys.Contains(sortedSearchStrings[i][0] + sortedSearchStrings[i][1])) stringIndex[sortedSearchStrings[i][0] + sortedSearchStrings[i][1]] = i;
}

Then to find the starting index in your list you just access:

int startOfCurrentCharPage = stringIndex[string.Format("{0}{1}", lastChar, currentChar)];

-1 for taking the time to write/use a non-standard container without proof that that's needed. — ChrisW, Sep 10 '10 at 20:27
@ChrisW: Seriously? -1 for giving a creative answer? Sorry I didn't just say to load it all into memory.. Nice job giving -1 to every answer that isn't yours.. — Jimmy Hoffa, Sep 10 '10 at 20:36
It's creative but I disagree that it's good. Good includes simplest, least maintenance, least effort/expense, least debugging, good-enough performance, understandable to the nest guy, etc. — ChrisW, Sep 10 '10 at 20:38
@ChrisW: I don't think you downvoted the other answers because they were wrong. They may not represent the best solutions in your mind, but it's the asker's opinion that counts. I think you downvoted them because they are not your answer. — Chris Laplante, Sep 10 '10 at 20:43
FYI the accepted answer to [The answer to tactical downvoting problem?](http://meta.stackexchange.com/questions/22771/the-answer-to-tactical-downvoting-problem/22776#22776), from a moderator, says that you can flag this "instance of possible tactical downvoting" for moderator attention (if you want to). — ChrisW, Sep 10 '10 at 21:02

score 1 · Answer 5 · answered Sep 10 '10 at 20:57

Would it be possible to work directly out of the xml document and skip the first step?

If so, you could just make use of Xml.XmlDocument, and from there, Xml.XmlNode.SelectNodes( string ), using xpath to navigate the document. I don't know what kind of information is present in the document, but the way you have the second stage worded gives the idea that at times both the path on C:\ and the path on E:\ are present? If so, it would be as simple as two IO.File.Exists checks and then a IO.File.Delete( ).

What I mean to say is that rather than searching your XML document N times for a string, do your search through the document and delete duplicates as you go so that you only run through the document once.

I don't use iTunes or have one of its library backups on hand to say if it could work or not, though.

ChrisW · Answer 6 · 2010-09-10T20:29:44.917

0

Read every string from the XML and write them into a HashSet<string>. When you want to lookup a string, look it up in the HashSet. The cost will be O(n) to read the XML, and O(n) to do the n lookups fro the HashSet. Don't try to search repeatedly in the XML (instead do your 20,000 searches in the HashSet), because the XML isn't indexed/optimized for searching.

edited Sep 10 '10 at 20:29

answered Sep 10 '10 at 20:24

ChrisW

54,973
13
116
224

Does string exist check 20k times

6 Answers6