Blog Archives

A bug in merge replication with FILESTREAM data


I wish I could say that every DBA has a love/hate relationship with Replication, but, let’s face it, it’s only hate. But it could get worse: it could be Merge Replication. Or even worse: Merge Replication with FILESTREAM.

What could possibly top all this hatred and despair if not a bug? Well, I happened to find one, that I will describe here.

The scenario

I published tables with FILESTREAM data before, but it seems like there is a particular planetary alignment that triggers an error during the execution of the snapshot agent.

This unlikely combination consists in a merge article with a FILESTREAM column and two UNIQUE indexes on the ROWGUIDCOL column. Yes, I know that generally it does not make sense to have two indexes on the same column, but this happened to be one of the cases where it did, so we had a CLUSTERED PRIMARY KEY on the uniqueidentifier column decorated with the ROWGUIDCOL attribute and, on top, one more NONCLUSTERED UNIQUE index on the same column, backed by a UNIQUE constraint.

Setting up the publication does not throw any error, but generating the initial snapshot for the publication does:

Cannot create, drop, enable, or disable more than one constraint,
column, index, or trigger named 'ncMSmerge_conflict_TestMergeRep_DataStream'
in this context. Duplicate names are not allowed.

Basically, the snapshot agent is complaining about the uniqueness of the name of one of the indexes it is trying to create on the conflict table. The interesting fact about this bug is that it doesn’t appear when the table has no FILESTREAM column and it doesn’t appear when the table doesn’t have the redundant UNIQUE constraint on the ROWGUID column: both conditions need to be met.

The script

Here is the full script to reproduce the bug.

Before you run it, make sure that:

  1. FILESTREAM is enabled
  2. Distribution is configured


USE master;
GO
— CLEANUP
IF DB_ID('TestMergeRep') IS NOT NULL
BEGIN
BEGIN TRY
exec TestMergeRep.sys.sp_dropmergepublication @publication=N'TestMergeRep';
END TRY
BEGIN CATCH
PRINT 'sp_dropmergepublication failed'
END CATCH
BEGIN TRY
exec TestMergeRep.sys.sp_removedbreplication 'TestMergeRep', 'merge';
END TRY
BEGIN CATCH
PRINT 'sp_removedbreplication failed'
END CATCH
BEGIN TRY
exec TestMergeRep.sys.sp_replicationdboption @dbname = N'TestMergeRep', @optname = N'merge publish', @value = N'false';
END TRY
BEGIN CATCH
PRINT 'sp_replicationdboption failed'
END CATCH
ALTER DATABASE TestMergeRep SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
DROP DATABASE TestMergeRep;
END
GO
— CREATE DATABASE
CREATE DATABASE TestMergeRep;
GO
— WE NEED A FILESTREAM FILEGROUP
DECLARE @path nvarchar(128), @sql nvarchar(max);
SELECT @path = LEFT(physical_name, LEN(physical_name) – CHARINDEX('\', REVERSE(physical_name),1) + 1)
FROM sys.database_files
WHERE type = 0
ALTER DATABASE TestMergeRep
ADD
FILEGROUP [TestmergeRep_FileStream01] CONTAINS FILESTREAM;
SET @sql = '
ALTER DATABASE TestMergeRep
ADD
FILE
( NAME = N''TestmergeRep_FS01'', FILENAME = ''' + @path + 'TestMergeRep_FS01'' , MAXSIZE = UNLIMITED)
TO FILEGROUP [TestmergeRep_FileStream01];
'
EXEC(@sql)
— CREATE TABLE
USE TestMergeRep;
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[DataStream](
[DataStreamGUID] [uniqueidentifier] ROWGUIDCOL NOT NULL,
[ValueData] [varbinary](max) FILESTREAM NOT NULL,
CONSTRAINT [DataStream_DataStream_PK] PRIMARY KEY CLUSTERED
(
[DataStreamGUID] ASC
)
)
GO
— I know it doesn't make sense, but the bug only shows
— when the table has a second UNIQUE constraint on the PK column
ALTER TABLE [DataStream] ADD CONSTRAINT UQ_MESL_DataStreamPK UNIQUE ([DataStreamGUID]);
— WORKAROUND: create the UNIQUE index without creating the UNIQUE constraint:
–CREATE UNIQUE NONCLUSTERED INDEX UQ_MESL_DataStreamPK ON [DataStream] ([DataStreamGUID]);
— SET UP REPLICATION
USE master
EXEC sp_replicationdboption
@dbname = N'TestMergeRep',
@optname = N'merge publish',
@value = N'true';
use [TestMergeRep]
exec sp_addmergepublication
@publication = N'TestMergeRep',
@description = N'Merge publication of database TestMergeRep.',
@retention = 30,
@sync_mode = N'native',
@allow_push = N'true',
@allow_pull = N'true',
@allow_anonymous = N'false',
@enabled_for_internet = N'false',
@conflict_logging = N'publisher',
@dynamic_filters = N'false',
@snapshot_in_defaultfolder = N'true',
@compress_snapshot = N'false',
@ftp_port = 21,
@ftp_login = N'anonymous',
@conflict_retention = 14,
@keep_partition_changes = N'false',
@allow_subscription_copy = N'false',
@allow_synctoalternate = N'false',
@add_to_active_directory = N'false',
@max_concurrent_merge = 0,
@max_concurrent_dynamic_snapshots = 0,
@publication_compatibility_level = N'100RTM',
@use_partition_groups = N'false';
exec sp_addpublication_snapshot
@publication = N'TestMergeRep',
@frequency_type = 1,
@frequency_interval = 1,
@frequency_relative_interval = 1,
@frequency_recurrence_factor = 1,
@frequency_subday = 1,
@frequency_subday_interval = 5,
@active_start_date = 0,
@active_end_date = 0,
@active_start_time_of_day = 10000,
@active_end_time_of_day = 235959;
exec sp_addmergearticle
@publication = N'TestMergeRep',
@article = N'DataStream',
@source_owner = N'dbo',
@source_object = N'DataStream',
@type = N'table',
@description = null,
@column_tracking = N'true',
@pre_creation_cmd = N'drop',
@creation_script = null,
@schema_option = 0x000000010C034FD1,
@article_resolver = null,
@subset_filterclause = N'',
@vertical_partition = N'false',
@destination_owner = N'dbo',
@verify_resolver_signature = 0,
@allow_interactive_resolver = N'false',
@fast_multicol_updateproc = N'true',
@check_permissions = 0,
@identityrangemanagementoption = 'manual',
@delete_tracking = N'true',
@stream_blob_columns = N'false',
@force_invalidate_snapshot = 1;
— Sets all merge jobs owned by sa
DECLARE @job_id uniqueidentifier
DECLARE c CURSOR STATIC LOCAL FORWARD_ONLY READ_ONLY
FOR
SELECT job_id
FROM msdb.dbo.sysjobs AS sj
INNER JOIN msdb.dbo.syscategories AS sc
ON sj.category_id = sc.category_id
WHERE sc.name = 'REPL-Merge';
OPEN c
FETCH NEXT FROM c INTO @job_id
WHILE @@FETCH_STATUS = 0
BEGIN
EXEC msdb.dbo.sp_update_job @job_id=@job_id , @owner_login_name=N'sa'
FETCH NEXT FROM c INTO @job_id
END
CLOSE c
DEALLOCATE c

After running the script, start the snapshot agent and you’ll see the error appearing:

snapshot

Workaround

One way to get rid of the bug is to enforce the uniqueness of the data by using a UNIQUE index instead of a UNIQUE constraint:

CREATE UNIQUE NONCLUSTERED INDEX UQ_MESL_DataStreamPK
ON [DataStream] ([DataStreamGUID]);

With this index, the snapshot agent completes correctly. Please note that the index would have been UNIQUE anyway, because its key is a superset of the primary key.

Hope this helps!

Please Vote!

This bug has been filed on UserVoice and can be found here: https://feedback.azure.com/forums/908035-sql-server/suggestions/34735489-bug-in-merge-replication-snapshot-agent-with-files

Please upvote it!

An annoying Bug in the Restore Dialog


Today, thanks to a customer, I discovered  one of those annoying little things that can really drive you nuts.

Basically, they were trying to restore a backup using the SSMS Restore Database window and they kept getting “No backupset selected to be restored” whenever a backup file was selected.

You just had to select a file for restore and click OK…

chooseFile

… to be met with an error message in the Restore Database window:

Restore_Error_Initial

The weird thing about it is that the backup file restored perfectly fine from a T-SQL script:

T-SQL

So it had to be something wrong with SSMS, but what?

Looking closer at the restore script, one thing stands out. Look at the file name:

T-SQL_highlight

Yep, there’s a leading whitespace in the file name. Could that be the source of the problem?

Let’s try again with the GUI in a slightly different way. This time I will copy the folder path from the “Backup File Location” textbox…

chooseFile_CopyFolder

… and paste it directly in the “File name” textbox, right before the file name:

chooseFile_CopyFolder2

This time everything works as expected.

Bottom line:

  1. This is a bug in SSMS: go on and vote this Connect item to have it fixed in a future version.
  2. Don’t use the GUI to restore a database.
  3. Don’t use the GUI at all.

Data Collector Clustering Woes


During the last few days I’ve been struggling to work around something that seems to be a bug in SQL Server 2008 R2 Data Collector in a clustered environment. It’s been quite a struggle, so I decided to post my findings and my resolution, hoping I didn’t contend in vain.

SYMPTOMS:

After setting up a Utility Control Point, I started to enroll my instances to the UCP and everything was looking fine.

UCP1

When an instance is enrolled to a UCP, the sysutility_mdw database can be used as a target Management Datawarehouse for non-utility collection sets, such as the built-in system collection sets. Actually, the utility database is the only possible target for any collection set, since the data collection configuration is shared between utility and non-utility collection sets.

That said, I enabled and started the system collection sets and eagerly waited for some data to show up on the built-in reports. As nothing turned up, I checked the data collector jobs and they were executing successfully, so the data had to be there, hidden somewhere.

In fact, the data had been collected and uploaded successfully, but it didn’t show up in the reports because of the way the data source had been registered in the Management Datawarehouse internal tables.

A quick look at [core].[source_info_internal] unveiled the root cause of the issue: the clustered instances had been registered with the physical name of the cluster node and not with the virtual server name of the instance.

UCP2

The built-in Data Collector reports filter data in this table using the server name connected in SSMS, which is obviously very different from what found in the data sources table. For instance, when connected to the clustered instance VIRTUAL1\INST01, I didn’t see any performance data showing in the report because the data had been collected using the physical network name of the owner node (eg. PHYSICAL1\INST01).

I know it may sound confusing at this point, so keep the following picture in mind while looking at servers and instances in the example.

UCP3

So, what was wrong with my setup? How could it be fixed?

I tried everything that came to my mind to no avail. In order, I tried:

  • Wiping out the datawarehouse database (sysutility_mdw)
  • Reinstalling the instance hosting the datawarehouse
  • Changing the upload jobs (this one looked promising, because some job steps contained the $(MACH)\$(INST) tokens instead of the $(SRVR) token I would have expected, but it didn’t work either)
  • Updating data in the tables directly

Nothing I tried solved the issue: every time the upload jobs ran at the enrolled instances, the wrong instance names turned up in the data sources table.

I suspect something was wrong in the Management Datawarehouse instance, since the same issue affected all the enrolled instances, no matter where they were installed. Question is I was unable to find a way to make it work.

The only thing that worked for me was forcing SQL Server to understand what my cluster setup looks like and preventing it from using cluster node names instead of virtual instance names.

As ugly as it can be, the only fix that worked was a view + INSTEAD OF TRIGGER combination.

First of all we need some tables to store the cluster layout, with nodes and instances.

-- Create a couple of tables in msdb to
-- describe the cluster topology
USE msdb;
GO

CREATE TABLE sysutility_ucp_managed_cluster_nodes (
    cluster_id int NOT NULL,
    node_name sysname NOT NULL PRIMARY KEY CLUSTERED
)
GO

CREATE TABLE sysutility_ucp_managed_cluster_servers (
    cluster_id int NOT NULL,
    virtual_server_name sysname NOT NULL PRIMARY KEY CLUSTERED,
    instance_name sysname NULL,
    server_name AS virtual_server_name + ISNULL('\' + NULLIF(instance_name,'MSSQLSERVER'),'')
)
GO

INSERT INTO sysutility_ucp_managed_cluster_nodes (cluster_id, node_name)
VALUES	(1,'PHYSICAL1'),
        (1,'PHYSICAL2'),
        (2,'PHYSICAL3'),
        (2,'PHYSICAL4')
GO

INSERT INTO sysutility_ucp_managed_cluster_servers (cluster_id, virtual_server_name, instance_name)
VALUES	(1,'VIRTUAL1','INST01'),
        (1,'VIRTUAL2','INST02'),
        (2,'VIRTUAL3','INST03')

GO

GRANT SELECT ON object::sysutility_ucp_managed_cluster_nodes TO [dc_proxy];
GRANT SELECT ON object::sysutility_ucp_managed_cluster_servers TO [dc_proxy];

Then we need to update the data sources collected using the cluster node name instead of the virtual name:

USE [sysutility_mdw]
GO

UPDATE trg
SET trg.instance_name = REPLACE(trg.instance_name COLLATE database_default, nodes.node_name + '\', srv.virtual_server_name + '\')
FROM [core].[source_info_internal] AS trg
INNER JOIN msdb.dbo.sysutility_ucp_managed_cluster_nodes AS nodes
    ON nodes.node_name = SUBSTRING(trg.instance_name, 1, LEN(nodes.node_name)) COLLATE database_default
INNER JOIN msdb.dbo.sysutility_ucp_managed_cluster_servers AS srv
    ON nodes.cluster_id = srv.cluster_id

-- Now server names should be ok
SELECT *
FROM sysutility_mdw.[core].[source_info_internal]
GO

Now we will replace the data sources table with a view that multiplies virtual server names for each possible owner node. This is required because the collection sets keep trying to upload data using the cluster node name and they fail miserably when the data source is not found in the table (“the specified collection set is not valid in this data warehouse”)

USE [sysutility_mdw]
GO

-- Rename the data sources table
EXEC sp_rename 'core.source_info_internal', 'source_info_internal_ms'

USE [sysutility_mdw]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE VIEW [core].[source_info_internal]
AS
SELECT sii.source_id
    ,sii.collection_set_uid
    ,instance_name = (nds.node_name + ISNULL('\' + NULLIF(srv.instance_name,'MSSQLSERVER'),'')) COLLATE Latin1_General_CI_AI
    ,sii.days_until_expiration
    ,sii.operator
FROM core.source_info_internal_ms AS sii
INNER JOIN msdb.dbo.sysutility_ucp_managed_cluster_servers AS srv
    ON sii.instance_name COLLATE database_default = srv.server_name
INNER JOIN msdb.dbo.sysutility_ucp_managed_cluster_nodes AS nds
    ON nds.cluster_id = srv.cluster_id
UNION ALL
SELECT *
FROM core.source_info_internal_ms

GO

And now the last thing we need to create is a trigger on the view, in order to control what gets written to the original table.
With this in place, we should have only “good” server names showing up in the instance_name column.

CREATE TRIGGER [core].[TR_source_info_internal_IU]
ON [core].[source_info_internal]
INSTEAD OF INSERT, UPDATE, DELETE
AS
BEGIN
    SET NOCOUNT ON;

    -- Update rows using the source_id
    UPDATE trg
    SET days_until_expiration = i.days_until_expiration
    FROM core.source_info_internal_ms AS trg
    INNER JOIN inserted AS i
        ON trg.source_id = i.source_id
    WHERE EXISTS (
        SELECT 1
        FROM deleted
        WHERE source_id = i.source_id
    )

    -- Turn INSERTs into UPDATEs using the
    -- cluster physical / virtual conversion
    UPDATE trg
    SET days_until_expiration = i.days_until_expiration
    FROM core.source_info_internal_ms AS trg
    INNER JOIN msdb.dbo.sysutility_ucp_managed_cluster_servers AS srv
        ON srv.server_name = trg.instance_name COLLATE database_default
    INNER JOIN msdb.dbo.sysutility_ucp_managed_cluster_nodes AS nodes
        ON nodes.cluster_id = srv.cluster_id
    INNER JOIN inserted AS i
        ON  trg.collection_set_uid = i.collection_set_uid
        AND trg.operator = i.operator
        AND nodes.node_name + ISNULL('\' + NULLIF(srv.instance_name,'MSSQLSERVER'),'') = i.instance_name COLLATE database_default
    WHERE NOT EXISTS (
        SELECT 1
        FROM deleted
        WHERE source_id = i.source_id
    )

    -- Handle proper INSERTs
    ;WITH newrows AS (
        SELECT collection_set_uid, v_server.instance_name, days_until_expiration, operator
        FROM inserted AS i
        CROSS APPLY (
            SELECT instance_name = COALESCE((
                    SELECT srv.server_name
                    FROM msdb.dbo.sysutility_ucp_managed_cluster_servers AS srv
                    INNER JOIN msdb.dbo.sysutility_ucp_managed_cluster_nodes AS nodes
                        ON nodes.cluster_id = srv.cluster_id
                    WHERE srv.server_name = i.instance_name COLLATE database_default
                ),(
                    SELECT srv.server_name
                    FROM msdb.dbo.sysutility_ucp_managed_cluster_servers AS srv
                    INNER JOIN msdb.dbo.sysutility_ucp_managed_cluster_nodes AS nodes
                        ON nodes.cluster_id = srv.cluster_id
                    WHERE nodes.node_name + ISNULL('\' + NULLIF(srv.instance_name,'MSSQLSERVER'),'') = i.instance_name COLLATE database_default
                ), (
                    SELECT i.instance_name COLLATE database_default
                )
            ) -- end coalesce
        ) AS v_server
        WHERE NOT EXISTS (
                SELECT 1
                FROM deleted
                WHERE source_id = i.source_id
            )
    )
    INSERT INTO core.source_info_internal_ms (collection_set_uid, instance_name, days_until_expiration, operator)
    SELECT collection_set_uid, instance_name, days_until_expiration, operator
    FROM newrows
    WHERE NOT EXISTS (
        SELECT 1
        FROM core.source_info_internal_ms
        WHERE collection_set_uid = newrows.collection_set_uid
            AND instance_name = newrows.instance_name
            AND operator = newrows.operator
    )

    DELETE trg
    FROM core.source_info_internal_ms AS trg
    WHERE EXISTS (
            SELECT 1
            FROM deleted
            WHERE source_id = trg.source_id
        )
        AND NOT EXISTS (
            SELECT 1
            FROM inserted
            WHERE source_id = trg.source_id
        )

END

Obviously I don’t have access to the source code of the Data Collector, but I suspect it uses a component which is not cluster-aware (dcexec.exe) and for some reason it ends up using the wrong IP address to communicate with the management datawarehouse. I have nothing to support my supposition: it only seems reasonable to me and it resembles something I experienced in the past with non-cluster-aware services, such as the SQL Browser.

Is this a bug? Definitely!

Should I report it on Connect? Maybe: the results with past items are so discouraging that I don’t think I’ll bother taking the time to file it. It must be said that reproducing this error is not easy: any other failover cluster instance I have laying around was not affected by this issue, so I guess it’s an edge case. Nevertheless, worth fixing.

 

UPDATE 15 sept 2016:

Turns out that the whole problem arises ONLY when a proxy account runs the data collection job steps. If the job steps are run impersonating the SQL Server Agent account, the error doesn’t turn up. I suggest this solution rather than the complicated view/trigger solution proposed here.

An annoying bug in Database Mail Configuration Wizard


Looks like a sneaky bug made its way from SQL Server 2005 CTP to SQL Server 2008 R2 SP1 almost unnoticed, or, at least, ignored by Microsoft.

Imagine that you installed a new SQL Server instance  (let’s call it “TEST”) and you want Database Mail configured in the same way as your other instances. No problem: you navigate the object explorer to Database Mail, start the wizard and then realize that you don’t remember the parameters to enter.

Not a big deal: you can copy those parameters from the server “PROD” that you configured last year.

You start the wizard on “PROD” and keep this window open to copy the parameter values in the “TEST” dialog.

OK, done? You just have to click “Finish” and… whoops!

This is the error you get when you try to apply the settings:

Wait: you don’t have a “dba_notify” account on server “TEST” yet. This error message was generated on PROD instead.

Looks like MS developers coded this dialog assuming that just one of these was open at a time and probably used an application-scoped global variable to store the Database Mail settings. Not only: the Database Mail Wizard looses its database context and points to a different instance.

I found a Connect item reporting the issue, dating back to July 2005:

http://connect.microsoft.com/SQLServer/feedback/details/207602/max-of-1-database-mail-wizard-open-at-a-time

Here is another one from 2006:

http://connect.microsoft.com/SQLServer/feedback/details/124958/database-mail-configuration-gui-does-not-appear-to-maintain-database-context

I haven’t tried on Denali CTP3 yet, but I would not be surprised if I found it to be still broken.

Until Microsoft decides to fix it, if you want to copy the Database Mail Settings from another server, start the Database Mail Wizard from a separate SSMS instance, or your settings can get totally screwed up.

Oracle BUG: UNPIVOT returns wrong data for non-unpivoted columns


Some bugs in Oracle’s code are really surprising. Whenever I run into this kind of issue, I can’t help but wonder how nobody else noticed it before.

Some days ago I was querying AWR data from DBA_HIST_SYSMETRIC_SUMMARY and I wanted to turn the columns AVERAGE, MAXVAL and MINVAL into rows, in order to fit this result set into a performance graphing application that expects input data formatted as {TimeStamp, SeriesName, Value}.

Columns to rows? A good match for UNPIVOT.

Oracle 11g introduced PIVOT and UNPIVOT operators to allow rows-to-columns and columns-to-rows transformations. Prior to 11g, this kind of transformation had to be coded with bulky CASE expressions (for PIVOT) or pesky UNION queries (for UNPIVOT). PIVOT and UNPIVOT allow developers to write more concise and readable statements, but I guess that not so many people have been using these features since their release, or they would have found this bug very soon.

Here is the statement I was trying to run:

WITH Metrics AS (
    SELECT to_date(to_char(BEGIN_TIME,'YYYY-MM-DD HH24'),'YYYY-MM-DD HH24') AS TS,
        AVG(AVERAGE) AS AVERAGE,
        MAX(MAXVAL) AS MAXVAL,
        MIN(MINVAL) AS MINVAL
    FROM DBA_HIST_SYSMETRIC_SUMMARY
    WHERE METRIC_NAME = 'Host CPU Utilization (%)'
    GROUP BY to_date(to_char(BEGIN_TIME,'YYYY-MM-DD HH24'),'YYYY-MM-DD HH24')
)
SELECT TS, aggregate, value
FROM Metrics
UNPIVOT (value FOR aggregate IN (AVERAGE, MAXVAL, MINVAL))
ORDER BY 1

The idea behind was to convert the date column into a string without the minute part, in order to convert it back to date and group by hour.

Surprisingly enough, this was the result:

Wrong results with UNPIVOT

The date column was returned with wrong data. Why?

The issue seems to be related to the date datatype, because converting back to date after the UNPIVOT works just fine:

WITH Metrics AS (
    SELECT to_char(BEGIN_TIME,'YYYY-MM-DD HH24') AS TS,
        AVG(AVERAGE) AS AVERAGE,
        MAX(MAXVAL) AS MAXVAL,
        MIN(MINVAL) AS MINVAL
    FROM DBA_HIST_SYSMETRIC_SUMMARY
    WHERE METRIC_NAME = 'Host CPU Utilization (%)'
    GROUP BY to_char(BEGIN_TIME,'YYYY-MM-DD HH24')
)
SELECT to_date(TS,'YYYY-MM-DD HH24') AS TS, aggregate, value
FROM Metrics
UNPIVOT (value FOR aggregate IN (AVERAGE, MAXVAL, MINVAL))
ORDER BY 1

This query, instead, produces the expected results.

Correct data with char column

I raised this issue with Oracle Support who filed it under bug ID 9900850.8. Both 11.2.0.1 and 11.2.0.2 seem to be be affected by this problem, but it’s quite unlikely to see it fixed before 12.1.

Time will tell.