Some months ago I posted a method to overcome some quirks in the MDW database in a clustered environment.
Today I tried to upgrade that clustered instance (in a test environment, fortunately) and I got some really annoying errors.
Actually, I got what I deserved for messing with the system databases and I wouldn’t even dare posting my experience if it wasn’t cause by something I suggested on this blog.
However, every cloud has a silver lining: many things can go wrong when upgrading a cluster and the resolution I will describe here can fit many failed upgrade situations.
So, what’s wrong with the solution I proposed back in march?
The offending line in that code is the following:
EXEC sp_rename 'core.source_info_internal', 'source_info_internal_ms'
What happened here? Basically, I renamed a table in the MDW and I created a view in its place.
One of the steps of the setup process tries to upgrade the MDW database with a script, that is executed at the first startup on an upgraded cluster node.
The script fails and the following message is found in the ERRORLOG:
Creating table [core].[source_info_internal]... 2013-10-24 09:21:02.99 spid8s Error: 2714, Severity: 16, State: 6. 2013-10-24 09:21:02.99 spid8s There is already an object named 'source_info_internal' in the database. 2013-10-24 09:21:02.99 spid8s Error: 912, Severity: 21, State: 2. 2013-10-24 09:21:02.99 spid8s Script level upgrade for database 'master' failed because upgrade step 'upgrade_ucp_cmdw.sql' encountered error 3602, state 51, severity 25. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion. 2013-10-24 09:21:02.99 spid8s Error: 3417, Severity: 21, State: 3. 2013-10-24 09:21:02.99 spid8s Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.
Every attempt to bring the SQL Server resource online results in a similar error message.
The only way to fix the error is to restore the initial state of the MDW database, renaming the table core.source_info_internal_ms to its original name.
But how can it be done, since the instance refuses to start?
Microsoft added and documented trace flag 902, that can be used to bypass the upgrade scripts at instance startup.
Remember that the startup parameters of a clustered instance cannot be modified while the resource is offline, because the registry checkpointing mechanism will restore the registry values stored in the quorum disk while bringing the resource online.
There are three ways to start the instance in this situation:
- modify the startup parameters by disabling checkpointing
- modify the startup parameters in the quorum registry hives
- start the instance manually at the command prompt
Method N.3 is the simplest one in this situation and is what I ended up doing.
Once the instance started, I renamed the view core.source_info_internal and renamed core.source_info_internal_ms to its original name.
The instance could then be stopped (CTRL+C at the command prompt or SHUTDOWN WITH NOWAIT in sqlcmd) and restarted removing the trace flag.
With the MDW in its correct state, the upgrade scripts completed without errors and the clustered instance could be upgraded to SQL Server 2012 without issues.
- Never, ever mess with the system databases. The MDW is not technically a system database, but it’s shipped by Microsoft and should be kept untouched. If you decide you absolutely need to modify something there, remember to undo your changes before upgrading and applying service packs.
- Always test your environment before upgrading. It took me 1 hour to fix the issue and not every upgrade scenario tolerates 1 hour of downtime. Think about it.
- Test your upgrade.
- Did I mention you need to test?
Jokes aside, I caught my error in a test environment and I’m happy it was not in production.
As the saying goes, better safe than sorry.
Recently I had to assess and tune quite a lot of SQL Server instances and one the things that are often overlooked is the location of the system databases.
I often see instance where the system databases are located in the system drives under the SQL Server default installation path, which is bad for many reasons, especially for tempdb.
I had to move the system databases so many times that I ended up coding a script to automate the process.
The script finds all system databases that are not sitting in the default data and log paths and issues the ALTER DATABASE statements needed to move the files to the default paths.
Obviously, to let the script work, the default data and log paths must have been set in the instance properties:
You may also point out that moving all system databases to the default data and log paths is not always a good idea. And you would be right: for instance, if possible, the tempdb database should be working on a fast dedicated disk. However, very often I find myself dealing with low-end servers where separate data and log disks are a luxury, not to mention a dedicated tempdb disk. If you are concerned about moving tempd to the default data and log paths, you can modify the script accordingly.
-- ============================================= -- Author: Gianluca Sartori - spaghettidba -- Create date: 2013-03-22 -- Description: Moves the system databases to the -- default data and log paths and -- updates SQL Server startup params -- accordingly. -- ============================================= SET NOCOUNT ON; USE master; -- Find default data and log paths -- reading from the registry DECLARE @defaultDataLocation nvarchar(4000) DECLARE @defaultLogLocation nvarchar(4000) EXEC master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'DefaultData', @defaultDataLocation OUTPUT EXEC master.dbo.xp_instance_regread N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'DefaultLog', @defaultLogLocation OUTPUT -- Loop through all system databases -- and move to the default data and log paths DECLARE @sql nvarchar(max) DECLARE stmts CURSOR STATIC LOCAL FORWARD_ONLY FOR SELECT ' ALTER DATABASE '+ DB_NAME(database_id) + ' MODIFY FILE ( ' + ' NAME = '''+ name +''', ' + ' FILENAME = '''+ CASE type_desc WHEN 'ROWS' THEN @defaultDataLocation ELSE @defaultLogLocation END + '\'+ RIGHT(physical_name,CHARINDEX('\',REVERSE(physical_name),1)-1) +'''' + ' )' FROM sys.master_files WHERE DB_NAME(database_id) IN ('master','model','msdb','tempdb') AND ( physical_name NOT LIKE @defaultDataLocation + '%' OR physical_name NOT LIKE @defaultLogLocation + '%' ) OPEN stmts FETCH NEXT FROM stmts INTO @sql WHILE @@FETCH_STATUS = 0 BEGIN PRINT @sql EXEC(@sql) FETCH NEXT FROM stmts INTO @sql END CLOSE stmts DEALLOCATE stmts -- Update SQL Server startup parameters -- to reflect the new master data and log -- files locations DECLARE @val nvarchar(500) DECLARE @key nvarchar(100) DECLARE @regvalues TABLE ( parameter nvarchar(100), value nvarchar(500) ) INSERT @regvalues EXEC master.dbo.xp_instance_regenumvalues N'HKEY_LOCAL_MACHINE', N'SOFTWARE\Microsoft\MSSQLServer\MSSQLServer\Parameters' DECLARE reg CURSOR STATIC LOCAL FORWARD_ONLY FOR SELECT * FROM @regvalues WHERE value LIKE '-d%' OR value LIKE '-l%' OPEN reg FETCH NEXT FROM reg INTO @key, @val WHILE @@FETCH_STATUS = 0 BEGIN IF @val LIKE '-d%' SET @val = '-d' + ( SELECT physical_name FROM sys.master_files WHERE DB_NAME(database_id) = 'master' AND type_desc = 'ROWS' ) IF @val LIKE '-l%' SET @val = '-l' + ( SELECT physical_name FROM sys.master_files WHERE DB_NAME(database_id) = 'master' AND type_desc = 'LOG' ) EXEC master.dbo.xp_instance_regwrite N'HKEY_LOCAL_MACHINE', N'SOFTWARE\Microsoft\MSSQLServer\MSSQLServer\Parameters', @key, N'REG_SZ', @val FETCH NEXT FROM reg INTO @key, @val END CLOSE reg DEALLOCATE reg
After running this script, you can shut down the SQL Server service and move the data and log files to the appropriate locations.
When the files are ready, you can bring SQL Server back online.
BE CAREFUL! Before running this script against a clustered instance, check what the xp_instance_regread commands return: I have seen cases with SQL Server not reading from the appropriate keys.
Yesterday I ran into this error message while installing a new SQL Server 2005 instance on a Windows 2003 cluster:
Setup failed to start on the remote machine. Check the Task scheduler event log on the remote machine.
SQL Server 2005 setup, in a clustered environment, relies on a remote setup process started on the passive nodes through a scheduled task:
For some weird reason, the remote scheduled task refuses to start if there is an active RDP session on the passive nodes. This KB article describes symptoms and resolution in detail: http://support.microsoft.com/kb/910851/en-us
If you look into the Scheduled Tasks log you will find something similar to this:
sql server Task did not appear to start on machine <machine_name> 267015
Needless to say that one of my co-workers was logged on the desktop of one of the passive nodes. So, if you need another good reason to lock everyone out of your servers’ desktop, here it is.
After resetting the offending session, remember to log on to all the passive nodes and kill the zombie setup processes you may have left. Also, delete the scheduled tasks from each node.
In order to avoid collation conflict issues with TempDB, all user databases on a SQL Server instance should be set to the same collation.
Temporary tables and table variables are stored in TempDB, that means that, unless explicitly defined, all character-based columns are created using the database collation, which is the same of the master database. If user databases have a different collation, joining physical tables to temporary tables may cause a collation conflict.
For instance you could have a user database with collation SQL_Latin1_General_CP1_CI_AS on a server with collation Latin1_General_CI_AS:
SELECT name, collation_name, is_system = CASE WHEN name IN ('master', 'model', 'tempdb', 'msdb') THEN 1 ELSE 0 END FROM sys.databases ORDER BY is_system DESC, name ASC
USE dblocal GO CREATE TABLE TestTable ( testColumn varchar(10) ) INSERT INTO TestTable VALUES ('conflict') CREATE TABLE #TempTable ( tempColumn varchar(10) ) INSERT INTO #TempTable VALUES ('conflict') SELECT * FROM TestTable INNER JOIN #TempTable ON testColumn = tempColumn
The only ways to avoid the conflict are:
- Define explicitly the collation on the temporary table with a collation name or database_default
- Add the COLLATE clause to the join predicate, using a collation name or database_default
-- OPTION 1: define the collation when creating the table CREATE TABLE #TempTable ( tempColumn varchar(10) COLLATE database_default ) -- OPTION 2: force a specific collation when querying the table SELECT * FROM TestTable INNER JOIN #TempTable ON testColumn = tempColumn COLLATE database_default
The first option must be preferred when possible, because it doesn’t need performing scalar calculations that end up making any index referencing the column virtually useless.
However, both methods require editing the code, which is not always possible. In those cases, changing the server collation is the only possible solution.
In the SQL Server 2000 days, there was a “nice” tool called rebuildm, that recreated the master database and could achieve this task simply. From SQL Server 2005 on, the only tool to rebuild system databases is the setup utility.
This is what Microsoft recommends to rebuild system databases: http://msdn.microsoft.com/en-us/library/ms179254.aspx
Setup /QUIET /ACTION=REBUILDDATABASE /INSTANCENAME=InstanceName /SQLSYSADMINACCOUNTS=accounts /[ SAPWD= StrongPassword ] /SQLCOLLATION=CollationName
Be warned that blogs and forums are filled with bad advice on this topic. The most frequent ones are these:
- Change the collation of the model database
TempDB is created from model when the service starts, which makes this method look smart. It’s not! Changing the database collation for model doesn’t change the collation of the objects inside the database: you would end up with a messy template for every new database.
- Copy the model database from another instance with the desired collation.
It could work, given that you copy the database from an instance with the exact same @@version. Even if you managed to copy the database, tempdb would have a collation other than those of master and msdb, which would mean moving the original problem among the system databases. Believe me, this is not what you want.
The recommended method requires running the setup and usually is not a big deal. This blog on MSDN (CSS Support Engineers) describes in depth how the setup works and how to troubleshoot some common issues: http://blogs.msdn.com/b/psssql/archive/2008/08/29/how-to-rebuild-system-databases-in-sql-server-2008.aspx
An undocumented method
There seems to be another undocumented method, mentioned (and advertised as safe) by Paul Randal (blog | twitter) in one of his presentations: http://devconnections.com/updates/LasVegas_Fall10/SQL/Randal-SQL-SDB407-Undocumented.pdf (page 17). The method relies on a startup parameter that allows changing the server collation when combined with a couple of trace flags:
sqlservr -m -T4022 -T3659 -q"new collation"
Trace flag 3659 allows logging all errors to sql server logs
Trace flag 4022 forces SQL Server to skip startup stored procedures (if you have any).
Startup option “-m” forces single user mode.
Startup option “-q” rebuilds all databases and contained objects into the specified collation, without reinstalling the instance.
Besides being undocumented, this method seems to have some pitfalls. Gail Shaw (blog | twitter) raises a couple of warnings on this thread at SqlServerCentral: http://www.sqlservercentral.com/Forums/Topic269173-146-1.aspx
So, why recommend a method that could fail or cause harm to your instance? There is a scenario where this method comes extremely handy, mainly because the documented and supported method fails, which is a big SQL Server 2005 cluster.
One of the biggest innovations in SQL Server 2008 setup concerned the cluster installation: in SQL Server 2005 the setup process was run against all passive nodes and then against the active node for the instance, while SQL Server 2008 setup runs only against the local machine and has to be repeated for each node in the cluster.
Running the setup on all the passive nodes, as SQL Server 2005 setup does, can be a pain in the rear. The more nodes you have, more things can go wrong.
Despite all my efforts, I was unable to run the setup successfully with rebuild database action on a SQL Server 2005 cluster. The log files did not report anything useful for troubleshooting: they were only saying that one of the passive nodes could not complete the setup. I was ready to uninstall the instance and install from scratch when I came across the undocumented –q parameter. I tried (I had nothing to lose, after all) and it worked.
I hope it can work for you too.