0

I would like to remove duplicate entries from a queue in an efficient way. The queue has a custom class with DateTime and FullPath and a few other things

private Queue<MyCustomClass> SharedQueue;

The DateTime in the class is the timestamp when inserted into the queue. The logic I would like to use is as following: Remove duplicates from the queue if the FullPath is identical within a 4 second window (i.e. if added to queue within 4 seconds of a duplicate fullpath). I have the events that I want to watch but a few duplicates will still arrive and that is OK.

I am using c# 2.0 and the FileSystemWatcher class and a worker queue.

There are a bunch of ways to do this: Trim the queue each time an item is added to it, or when I am working on the queue skip the processing of the current duplicate item.

Or should I use a 'global private' variable Dictionary< String, DateTime> ? So I can quickly search it? or a local copy of the queue ? Perhaps it is best to limit the local queue to 100 items in case of many file events? Though in my case it 'should be' only a relatively few files to monitor in a folder... but things always change...

Thanks for any help.

:Edit: Feb 10 8:54 EST: So I decided to implement a good simple solution as far as I can tell. I don't think I am holding on to the Dict keys too long...

:Edit: Feb 10 9:53 EST: Updated as my Dictionary cannot contain duplicate values.

   public void QueueInput(HotSynchUnit.RcdFSWFile rcd)
// start the worker thread when program starts.
// call Terminate.Set() in the programs exit routine or close handler etc.
{
  // lock shared queue
  lock (SharedQueue)
  {
    if (!IsDuplicateQueueInput(rcd))  // only add unique values to queue
    {
      SharedQueue.Enqueue(rcd);
      SomethingToDo.Set();
    }
  }
} // public void QueueInput

private bool IsDuplicateQueueInput(HotSynchUnit.RcdFSWFile rcd)
/* Return true if the object is a duplicate object.
 * Pseudo Code:
 * 
 * isDuplicate = false
 * Lock Dictionary
 * -If lastTimeStamp > 4 seconds ago then       // Optimization: save lastTimeStamp
 *    if Dict.Count > 0 then clear Dictionary
 *    return isDuplicate
 * -If not Dict.TryGetValue(sPath, dtTimeStamp) then
 *    Dict.AddKey()
 * -Else
 *    Compare key timestamp to Currenttime
 *    if key timestamp is <= 4 seconds ago then
 *       IsDuplicate = True
 *
 *    Dict.RemoveKey()
 *    Dict.AddKey()
 * 
 * return isDuplicate
*/
{
  // put real code here
}
user610064
  • 481
  • 2
  • 11
  • 25

4 Answers4

1

I just thought about using any collection similar to a generic hashtable... Something like this:

Dictionary<string, YourClass> dict = new Dictionary<string, YourClass>();

/// just let's assume you want to add/check for "c:\demo.txt"

if (!dict.ContainsKey(@"c:\demo.txt"))
{
   /// add items to dict by passing fullPath as key and your objects as value
   dict.add(@"c:\demo.txt", obj1);
} 
else if (dict[@"c:\demo.txt"].CheckForIntervall())
{
   /// replace current object in dictionary with new object - in case you want to..
   /// or just do what you want to 
}

edit - your custom class may have some functionality like this:

class YOURCUSTOMCLASS
{
    private DateTime creationTime;

    public DateTime CreationTime
    { get { return creationTime; } }

    public YOURCUSTOMCLASS(parametersGoesHere xyz)
    {
          creationTime = DateTime.Now;
    }

    /// in this case this method will return true
    /// if the timeSpan between this object and otherObject
    /// is greater than 4 seconds
    public bool CheckForInterval(YOURCUSTOMCLASS otherObject)
    {
         TimeSpan diff = otherObj.CreationTime.Subtract(creationTime);

         /// you may replace 4 through any other digit, or even better take
         /// a const/global var/static ...
         return diff.TotalSeconds > 4;
    }

    /// all the other stuff you need ...
}

Of course you will loose the functionality of a queue - but you will get an massive increase in runtime if your queue containts many elements.

hth

Pilgerstorfer Franz
  • 8,303
  • 3
  • 41
  • 54
  • you don't need to check for the existence of the key. You can just remove that check and make dict.add be dict[key] = "demo.txt" and you will result in the same thing but just a little cleaner. – phillip Feb 09 '11 at 17:58
  • This is what I am basically doing with the dictionary already but locking the dictionary and limiting it to n rows. I am interested in the CheckForInterval() and how you would do that. – user610064 Feb 09 '11 at 18:17
  • @philip - of course you are right - in the original question there was an additional task - that an existing object may just be added/replaced in case if the given interval is timed out. @user610064 I'll will have a further look into it an respond in a few minutes – Pilgerstorfer Franz Feb 09 '11 at 18:57
0

I would make a subclass:

class MyDeduplicatedQueue : Queue<MyCustomObject> {
    /// etc
}

Then you can put all the appropriate filtering logic into the Enqueue method.

recursive
  • 83,943
  • 34
  • 151
  • 241
0

I would make a wrapper class and not extend from queue, as users of the base type Queue expect different behavior. (Data Contracts in .NET 4.0 might even complain when you do so.)

Internally you can have a actual queue to which to redirect the required calls. Every Queue() call you could add the new element to a Dictionary when it is not contained already. Before doing so, you could empty all elements that are older than x seconds from this dictionary, and add them to the inner queue in order.

When dequeuing, you will have to check whether the inner queue contains elements, and otherwise pick the earliest element from the dictionary.

This ofcourse is just one possible implementation. When a lot of different elements might get queued quickly, the dictionary will fill up quickly and additional logic might have to be added to resolve that.

Steven Jeuris
  • 18,274
  • 9
  • 70
  • 161
  • public void QueueInput(HotSynchUnit.RcdFSWFile rcd) // start the worker thread when program starts. // call Terminate.Set() in the programs exit routine or close handler etc. { // lock shared queue lock (SharedQueue) { SharedQueue.Enqueue(rcd); SomethingToDo.Set(); } } // public void QueueInput – user610064 Feb 09 '11 at 17:43
  • Yes, seems so. Like I said, different implementations are possible. When using a local queue instead of a Dictionary you trade the speed of the lookup time upon adding for speed you get for cleaning up added elements older than x seconds. You could also use both, to trade memory usage for speed in both cases. :) My advise is, first write it easy, when it's not performant enough, update the solution accordingly. – Steven Jeuris Feb 09 '11 at 17:48
  • OK thanks. For now I'll use the Dictionary in the QueueInput routine, just before calling Enqueu. I think I prefer using a Dictionary for simplicity instead of synching queues. I'll just limit the Dictionary count to 1000 rows or something? – user610064 Feb 09 '11 at 17:52
  • Processors nowadays are very performant. ;p 1000 elements shouldn't pose too much of a problem. – Steven Jeuris Feb 09 '11 at 17:54
0

Why not just reject inserts if they have duplicate paths? All you have to do is a linear search starting from the tail of the queue and stop when you either find a duplicate (and reject the insert) or when the timestamp exceeds your time limit (and insert the record)? Seems a lot simpler than keeping another data structure around and all the associated logic.

TMN
  • 3,060
  • 21
  • 23