0

In my Azure role startup code I instantiate a DCOM object to ensure that it can be instantiated and then immediately release it since I don't really need it at that moment.

I do that in a separate thread that actually news the corresponding C# RCW class and the main thread Thread.Join()s that thread with a 30-seconds timeout. In case the thread is still running after Thread.Join() returns this means the DCOM object takes suspiciously long to create and so Thread.Abort() is called and the role restarts. 30 second should be enough - the object is lightweight and doesn't do anything time-consuming on instantiation.

That code worked just fine until I tried to scale up my service dramatically. I asked to support to lift the Compute cores quota and tried to scale to 100 (one hundred) instances.

Now most of the instances started fine, but some of them faced exactly the situation described above - the DCOM object creation took too long and so the code threw exception which caused the role to restart.

I repeated the test several times. Once I ask to scale up by some dozens of instances the problem is reproduced in some of the newly started instances. Since all the instances are uniform I have no idea what might be causing this behavior.

What might be the reason for the DCOM object to take so long in some instances only?

sharptooth
  • 167,383
  • 100
  • 513
  • 979
  • You might have to share more, but this certainly sounds like a race condition (or some other kind of timing bug). The more instances you run, the greater chance you have of hitting the race condition in your code. I wouldn't expect that it's about "some instances" and rather is about "some times I execute this code." – user94559 Jun 19 '12 at 17:53
  • @smarx: Looks like it's general "everything is luggish" condition - I've added an answer. – sharptooth Jun 26 '12 at 09:34

1 Answers1

0

My research so far shows that when I scale up by a large number of instances some instances will be rather luggish near the start moment, especially in respect to IO-bound operations. I assume this is because the host (8-core hardware server) where the VM is running is doing something heavy and so there's a serious competition for the IO. Under these conditions instantiating a DCOM object that typically takes about 1 second can take up to 40 seconds and my timeout should just be increased.

sharptooth
  • 167,383
  • 100
  • 513
  • 979