Saturday 10 March 2012

FEEE

It was like Christmas again: new blades arrived to add to our UAT and Production environments. And to start with, everything seemed rosy: twice the number of cores, to start with, and the installation of our server code went pretty smoothly.

Then I spun everything up, and watched all our processes die. No meaningful error messages, just some flurries of forlorn my-child-went-away pleas for help in our workflow manager logs.

Stumped, I wondered if I should be looking at the “socket forcibly closed” exceptions I could see that indicated my-child-went-away; but no. The child processes were just dying, and these errors were just an artefact. Later, the workflow manager eventually timed out the children, noting that they’d already finished with error code –254667849. Or something similar; the set varied.

I fired up eventvwr and there it was – a veritable storm of .NET framework errors in the application logs, with two characteristically repeated, one of which was:

.NET Runtime version 2.0.50727.3607 - Fatal Execution Engine Error (7A09795E) (80131506)

What followed turned out to be several hours of searching and getting rapidly downhearted. The whole of the internet seemed to have seen this very exception, and it seemed critically linked to either trying to run a process as a user that had no associated user profile on the box, or to vague “problems in the .NET 3.5 SP1 on Win 2k3 64bit boxes” that no-one – except one particularly determined individual and his team – seemed to get to the bottom of.

I checked: the user had an associated profile. I went to get some coffee.

I had to get a clearer picture of where our workers were dying. I installed the Debugging Tools for Windows and SysInternals suite on the box and then stopped.

The child processes were dying, but they were being spun up by a workflow manager. How was I going to get windbg to attach if I couldn’t spin up the process myself? They died immediately – there’s no way I’d be able to attach in time.

Luckily, Chris Oldwood knew the way. The gflags app in the Debugging Tools set let’s you configure a lot about the debug environment – and lets you even specify that for a given app (under the ImageMap tab) you should spin up a specific debugger – like c:\program files (x86)\debugging tools for windows (x86)\windbg.exe, for example.

It even worked! I pushed some work through the system and watched windbg attach, and let it run through to the exception. !analyze –v took an age – over 10 minutes – without actually giving me any information, so I used procdump to make a full dump, and analyzed that on my own machine.

procdump -ma TaskEngine.exe d:\some\temp\directory\TaskEngine.dmp

There, !analyze -v brought up two things. One, that it was dying while trying to make a SQL Server connection, and another that it might be having trouble associating a user context with the login

Trying with independent tools from that box also fail to connect to the database. The database was definitely up – I had to talk to it to push work through the system. The native client installed on the box is the same version as that installed on the existing servers.

Something’s clearly missing, and as yet, I don’t know what.

No comments: