I recently had a deployment issue where a push of new code caused a worker role to continually restart – everything worked locally, but the thing just wouldn’t stay up in the cloud.
The ability to remote into an instance is invaluable for diagnosing this sort of thing, especially when your instance is falling down before it even runs your start-up code. In my instance, the Application event log was filling up with error entries at a rate of 4 a second, all tied back to the Windows Azure Caching Client installer. That didn’t make any sense – the thing hadn’t changed for months. With so many log entries it was hard to tell what was happening.
However, the Windows Azure event log under Applications and Services Logs was much more helpful. It seemed that the role was restarting due to a version conflict of the New Relic monitoring agent – nothing to do with the Caching Client installer. Perhaps the caching client installer was being kicked off by the role starting, and so by terminating it was killing the child process leading to normal Application log entries?
Regardless – it set me down the right path of fixing the dodgy reference and redeploying – making the instance stable at the same time.