Some months ago I posted a method to overcome some quirks in the MDW database in a clustered environment.
Today I tried to upgrade that clustered instance (in a test environment, fortunately) and I got some really annoying errors.
Actually, I got what I deserved for messing with the system databases and I wouldn’t even dare posting my experience if it wasn’t cause by something I suggested on this blog.
However, every cloud has a silver lining: many things can go wrong when upgrading a cluster and the resolution I will describe here can fit many failed upgrade situations.
So, what’s wrong with the solution I proposed back in march?
The offending line in that code is the following:
EXEC sp_rename 'core.source_info_internal', 'source_info_internal_ms'
What happened here? Basically, I renamed a table in the MDW and I created a view in its place.
One of the steps of the setup process tries to upgrade the MDW database with a script, that is executed at the first startup on an upgraded cluster node.
The script fails and the following message is found in the ERRORLOG:
Creating table [core].[source_info_internal]... 2013-10-24 09:21:02.99 spid8s Error: 2714, Severity: 16, State: 6. 2013-10-24 09:21:02.99 spid8s There is already an object named 'source_info_internal' in the database. 2013-10-24 09:21:02.99 spid8s Error: 912, Severity: 21, State: 2. 2013-10-24 09:21:02.99 spid8s Script level upgrade for database 'master' failed because upgrade step 'upgrade_ucp_cmdw.sql' encountered error 3602, state 51, severity 25. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion. 2013-10-24 09:21:02.99 spid8s Error: 3417, Severity: 21, State: 3. 2013-10-24 09:21:02.99 spid8s Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.
Every attempt to bring the SQL Server resource online results in a similar error message.
The only way to fix the error is to restore the initial state of the MDW database, renaming the table core.source_info_internal_ms to its original name.
But how can it be done, since the instance refuses to start?
Microsoft added and documented trace flag 902, that can be used to bypass the upgrade scripts at instance startup.
Remember that the startup parameters of a clustered instance cannot be modified while the resource is offline, because the registry checkpointing mechanism will restore the registry values stored in the quorum disk while bringing the resource online.
There are three ways to start the instance in this situation:
- modify the startup parameters by disabling checkpointing
- modify the startup parameters in the quorum registry hives
- start the instance manually at the command prompt
Method N.3 is the simplest one in this situation and is what I ended up doing.
Once the instance started, I renamed the view core.source_info_internal and renamed core.source_info_internal_ms to its original name.
The instance could then be stopped (CTRL+C at the command prompt or SHUTDOWN WITH NOWAIT in sqlcmd) and restarted removing the trace flag.
With the MDW in its correct state, the upgrade scripts completed without errors and the clustered instance could be upgraded to SQL Server 2012 without issues.
- Never, ever mess with the system databases. The MDW is not technically a system database, but it’s shipped by Microsoft and should be kept untouched. If you decide you absolutely need to modify something there, remember to undo your changes before upgrading and applying service packs.
- Always test your environment before upgrading. It took me 1 hour to fix the issue and not every upgrade scenario tolerates 1 hour of downtime. Think about it.
- Test your upgrade.
- Did I mention you need to test?
Jokes aside, I caught my error in a test environment and I’m happy it was not in production.
As the saying goes, better safe than sorry.
Yesterday I ran into this error message while installing a new SQL Server 2005 instance on a Windows 2003 cluster:
Setup failed to start on the remote machine. Check the Task scheduler event log on the remote machine.
SQL Server 2005 setup, in a clustered environment, relies on a remote setup process started on the passive nodes through a scheduled task:
For some weird reason, the remote scheduled task refuses to start if there is an active RDP session on the passive nodes. This KB article describes symptoms and resolution in detail: http://support.microsoft.com/kb/910851/en-us
If you look into the Scheduled Tasks log you will find something similar to this:
sql server Task did not appear to start on machine <machine_name> 267015
Needless to say that one of my co-workers was logged on the desktop of one of the passive nodes. So, if you need another good reason to lock everyone out of your servers’ desktop, here it is.
After resetting the offending session, remember to log on to all the passive nodes and kill the zombie setup processes you may have left. Also, delete the scheduled tasks from each node.