Fragmentation on Replicated Tables

I’ve heard more than one person state that they do not need to defrag their tables, because they’re replicated and the source tables are defragged regularly. Let’s set this myth straight.

Repeat after me: defragmentation processes are not replicated.

Good. Thank you.

For those non-believers, let me prove it:

SET NOCOUNT ON;
 
/* Create publication db */
CREATE DATABASE sandbox_publisher;
GO
 
/* Create subscription db */
CREATE DATABASE sandbox_subscriber;
GO
 
/* Navigate to publisher db */
USE sandbox_publisher;
GO
 
/* Create a table to play with */
CREATE TABLE myTable
(
      myID UNIQUEIDENTIFIER
    , myColumn VARCHAR(2000)
 
    CONSTRAINT PK_myTable PRIMARY KEY CLUSTERED (myID)
);
/* NOTE: please never, ever use this
   type of clustered index in real-life.
   The whole purpose is to generate
   substantial fragmentation for 
   illustrative purposes. */
 
/* Populate it with some non-sequential
   records to generate fragmentation */
DECLARE @iteration INT;
SET @iteration = 0;
 
WHILE @iteration < 1000
BEGIN
    INSERT INTO myTable
    SELECT NewID(), REPLICATE('A', 2000);
 
    SET @iteration = @iteration + 1;
END
 
/* Now let's verify that we really did
   make a mess of our clustered index */
SELECT avg_fragmentation_in_percent
    , page_count
FROM sys.dm_db_index_physical_stats 
(DB_ID(), OBJECT_ID(N'MyTable'), NULL , NULL, N'Limited') 
OPTION (MaxDop 1);
 
/* Results:
 
avg_fragmentation_in_percent page_count
---------------------------- --------------------
99.0157480314961             508
 
99% fragmented, not too shabby!  Your results 
may vary slightly, based on the guid values
generated by NewID().
*/
 
/* Now, let's set up replication */
USE sandbox_publisher;
GO
 
/* Enable publication */
EXECUTE SP_REPLICATIONDBOPTION
      @dbname = N'sandbox_publisher'
    , @optname = N'publish'
    , @VALUE = N'true';
GO
 
/* Create our publication */
EXECUTE SP_ADDPUBLICATION
      @publication = N'myTestPublication' 
    , @sync_method = N'concurrent'
    , @description = N'Test Publication'
    , @status = N'active'
    , @independent_agent = N'true'
    , @immediate_sync = N'true'
    , @retention = 0
GO
 
/* Create our snapshot agent */
EXECUTE SP_ADDPUBLICATION_SNAPSHOT
      @publication = N'myTestPublication'
    , @frequency_type = 1;
 
/* Add our table as an article */
EXECUTE SP_ADDARTICLE
      @publication = N'myTestPublication'
    , @article = N'myTable'
    , @source_owner = N'dbo'
    , @source_object = N'myTable'
    , @type = N'logbased'
    , @destination_table = N'myTable'
    , @destination_owner = N'dbo'
    , @ins_cmd = N'CALL sp_MSins_dbomyTable'
    , @del_cmd = N'CALL sp_MSdel_dbomyTable'
    , @upd_cmd = N'MCALL sp_MSupd_dbomyTable';
GO
 
/* Generate an initial snapshot */
EXECUTE sp_startpublication_snapshot
      @publication = N'myTestPublication';
 
/* Create our subscription */
EXECUTE SP_ADDSUBSCRIPTION 
      @publication = N'myTestPublication'
    , @subscriber = N'YourServerName'
    , @destination_db = N'sandbox_subscriber';
 
/* At this point, basic transactional replication 
   should be configured.  You should now have a 
   copy of myTable in your subscription database. */
 
USE sandbox_subscriber;
GO
 
/* Let's take a look at fragmentation on the 
   subscription database. */
SELECT avg_fragmentation_in_percent
    , page_count
FROM sys.dm_db_index_physical_stats 
(DB_ID(), OBJECT_ID(N'MyTable'), NULL , NULL, N'Limited') 
OPTION (MaxDop 1);
 
/* Results:
 
avg_fragmentation_in_percent page_count
---------------------------- --------------------
0.29940119760479             334
 
Less than 1% fragmented!  Why is this?  During
initialization, a copy of the schema and data
is sent to the subscriber.  The data is sorted
before insertion into the destination table.
*/
 
USE sandbox_publisher;
GO
 
/* Let's insert another 1000 records and see
   what happens... */
DECLARE @iteration INT;
SET @iteration = 0;
 
WHILE @iteration < 1000
BEGIN
    INSERT INTO sandbox_publisher.dbo.myTable
    SELECT NewID(), REPLICATE('A', 2000);
 
    SET @iteration = @iteration + 1;
END
 
SELECT avg_fragmentation_in_percent
    , page_count
FROM sys.dm_db_index_physical_stats 
(DB_ID(), OBJECT_ID(N'MyTable'), NULL , NULL, N'Limited') 
OPTION (MaxDop 1);
 
/* Results:
 
avg_fragmentation_in_percent page_count
---------------------------- --------------------
98.6193293885602             1014
 
The publisher is still very fragmented 
(as expected) */
 
USE sandbox_subscriber;
GO
 
SELECT avg_fragmentation_in_percent
    , page_count
FROM sys.dm_db_index_physical_stats 
(DB_ID(), OBJECT_ID(N'MyTable'), NULL , NULL, N'Limited') 
OPTION (MaxDop 1);
 
/* Results:
 
avg_fragmentation_in_percent page_count
---------------------------- --------------------
95.4034729315628             979
 
But now the subscriber is fragmented too!  This
is because each individual transaction is 
replicated, so out-of-order inserts affect
both the publishing and the subcribing 
databases equally.
*/
 
/* Finally, let's rebuild our index on the 
   publisher and see what happens */
USE sandbox_publisher;
GO
 
ALTER INDEX PK_myTable ON myTable Rebuild;
 
SELECT avg_fragmentation_in_percent
    , page_count
FROM sys.dm_db_index_physical_stats 
(DB_ID(), OBJECT_ID(N'MyTable'), NULL , NULL, N'Limited') 
OPTION (MaxDop 1);
 
/* Results:
 
avg_fragmentation_in_percent page_count
---------------------------- --------------------
0                            667
 
0% fragmentation in the publishing table!  
What about our subscribing table?
*/
 
USE sandbox_subscriber;
GO
 
SELECT avg_fragmentation_in_percent
    , page_count
FROM sys.dm_db_index_physical_stats 
(DB_ID(), OBJECT_ID(N'MyTable'), NULL , NULL, N'Limited') 
OPTION (MaxDop 1);
 
/* Results:
 
avg_fragmentation_in_percent page_count
---------------------------- --------------------
95.4034729315628             979
 
As you can see, the subcribing table was
completely unaffected by the defrag
activities on the publisher.
*/

So there you have it, proof that defrag operations are not replicated! Now go defrag your replicated tables. :)

Oh, and if you need help defragging your indexes, check out my index defrag script (2005, 2008).

Creating a 60 GB Index

Recently, I needed to create an index on a 1.5 billion row table. I’ve created some large indexes before, but this was the largest, so I thought I’d share my experience in case anyone was interested.

The plan was to create the following index:

Create NonClustered Index IX_indexName_unpartitioned
    On databaseName.dbo.tableName(columnList)
    Include (includedColumnList)
    With (MaxDop = 1, Online = On, Sort_In_TempDB = On)
    On [Primary];

This is an unpartitioned, non-clustered index being built on a partitioned table. Normally, when you build an aligned partitioned index, the index creation process requires less memory and has less noticeable impact on the system, because each partition is handled individually, one at a time. But as this is an unpartitioned (unaligned) index, each partition was built concurrently, requiring more memory and causing a greater impact on performance. Because of this, I needed to restrict the process to MaxDop 1; otherwise, the server would suffer because of too much memory pressure.

I chose Sort_In_TempDB = On because:

  • I’m building this index online on a very busy table and cannot afford to impact normal oeprations. By using Sort_In_TempDB = On, index transactions are separated from user transactions, allowing the user transaction log to be truncated. *
  • TempDB is on a different volume and therefore should reduce the duration of the operation.
  • The recovery for the user database is full, and the recovery for the TempDB is simple. Sorting in TempDB would minimize logging.

* Note: the transaction log for the user database still grew at a much faster rate than normal and had to be closely monitored during this operation to ensure enough free space remained.

The size of the indexed columns is 25 bytes. So I ran my calculations and came up with 36gb space requirement. We increased TempDB to 50gb and gave it a go. An hour later… ERROR. The process terminated because there would not be enough space free in TempDB to complete the operation. Obviously, my calculations were incorrect. After speaking with Chris Leonard, a man who is way too smart for his own good, I realized I had not included my clustered index in the size calculations. Doh.

Re-running my estimates, here’s what I came up with:

Index Size 25   bytes
Clustered Index Size 16   bytes
Records per Page 197
Est. Rows 1,575,000,000
Est. Number of Pages 7,995,000
Space Requirements 59   GB

 

Obviously, 50gb of free space just wasn’t going to cut it. I decided to give TempDB a little wiggle room and bumped up the space in TempDB to 70gb (not as easy as you’d imagine, I had to requisition more space on the SAN), then re-ran the operation. Success! The process completed in 3 hours and 24 minutes. There was a mild increase in CPU, but no applications or users experienced any issues.

For those interested in the particulars: this was used for a single-record look-up and could not be filtered by the partitioning key. The non-partitioned version of this index has 80% less reads and 11% less CPU than its partitioned counterpart.

If you’re interested in learning more about indexes, here’s some recommended reading:

Index Clean-Up Scripts

I’ve been spending a lot of time lately looking at indexing in my production environments… dropping un-used ones, adding missing ones, and fine-tuning the ones I already have. I thought I’d share some of the scripts I’ve been using to accomplish this.

Here’s the script I use to find any un-used indexes. This relies heavily on the sys.dm_db_index_usage_stats DMV (2005+). This query will also return the SQL statements needed to drop the indexes for convenience. This does NOT mean you should necessarily drop the index. This is only a guide and a starting point; only you know how your application is used and whether SQL Server’s recommendations make sense.

Un-Used Indexes Script

Declare @dbid int
    , @dbName varchar(100);
 
Select @dbid = DB_ID()
    , @dbName = DB_Name();
 
With partitionCTE (object_id, index_id, row_count, partition_count) 
As
(
    Select [object_id]
        , index_id
        , Sum([rows]) As 'row_count'
        , Count(partition_id) As 'partition_count'
    From sys.partitions
    Group By [object_id]
        , index_id
) 
 
Select Object_Name(i.[object_id]) as objectName
        , i.name
        , Case 
            When i.is_unique = 1 
                Then 'UNIQUE ' 
            Else '' 
          End + i.type_desc As 'indexType'
        , ddius.user_seeks
        , ddius.user_scans
        , ddius.user_lookups
        , ddius.user_updates
        , cte.row_count
        , Case When partition_count > 1 Then 'yes' 
            Else 'no' End As 'partitioned?'
        , Case 
            When i.type = 2 And i.is_unique_constraint = 0
                Then 'Drop Index ' + i.name 
                    + ' On ' + @dbName 
                    + '.dbo.' + Object_Name(ddius.[object_id]) + ';'
            When i.type = 2 And i.is_unique_constraint = 1
                Then 'Alter Table ' + @dbName 
                    + '.dbo.' + Object_Name(ddius.[object_ID]) 
                    + ' Drop Constraint ' + i.name + ';'
            Else '' 
          End As 'SQL_DropStatement'
From sys.indexes As i
Inner Join sys.dm_db_index_usage_stats ddius
    On i.object_id = ddius.object_id
        And i.index_id = ddius.index_id
Inner Join partitionCTE As cte
    On i.object_id = cte.object_id
        And i.index_id = cte.index_id
Where ddius.database_id = @dbid
Order By 
    (ddius.user_seeks + ddius.user_scans + ddius.user_lookups) Asc
    , user_updates Desc;

This next script relies on several DMV’s (2005+) that identify missing indexes. While this is good information, the index recommendations do not always make sense and/or sometimes overlap. Also, these DMV’s store data since the SQL Server was last restarted, so if it’s been a while since your server was rebooted, this data may be out of date. This script also provides a SQL statement, in case you do decide to create the index, but it doesn’t take into consideration advanced parameters (i.e. sort_in_tempDB, Online, MaxDop, etc.) and only provides a basic create statement. Nonetheless, it’s another good starting point.

Missing Index Script

Select t.name As 'affected_table'
    , 'Create NonClustered Index IX_' + t.name + '_missing_' 
        + Cast(ddmid.index_handle As varchar(10))
        + ' On ' + ddmid.statement 
        + ' (' + IsNull(ddmid.equality_columns,'') 
        + Case When ddmid.equality_columns Is Not Null 
            And ddmid.inequality_columns Is Not Null Then ',' 
                Else '' End 
        + IsNull(ddmid.inequality_columns, '')
        + ')' 
        + IsNull(' Include (' + ddmid.included_columns + ');', ';'
        ) As sql_statement
    , ddmigs.user_seeks
    , ddmigs.user_scans
    , Cast((ddmigs.user_seeks + ddmigs.user_scans) 
        * ddmigs.avg_user_impact As int) As 'est_impact'
    , ddmigs.last_user_seek
From sys.dm_db_missing_index_groups As ddmig
Inner Join sys.dm_db_missing_index_group_stats As ddmigs
    On ddmigs.group_handle = ddmig.index_group_handle
Inner Join sys.dm_db_missing_index_details As ddmid 
    On ddmig.index_handle = ddmid.index_handle
Inner Join sys.tables As t
    On ddmid.object_id = t.object_id
Where ddmid.database_id = DB_ID()
    And Cast((ddmigs.user_seeks + ddmigs.user_scans) 
        * ddmigs.avg_user_impact As int) > 100
Order By Cast((ddmigs.user_seeks + ddmigs.user_scans) 
    * ddmigs.avg_user_impact As int) Desc;

Back from Vacation!

I’m back from Europe! We had a fantastic time, and saw some incredible sights, but I must say I am glad to be back home. :)

A couple of quick items to mention. First, we now have an official PASS Chapter for the Cedar Rapids area, called I380 Corridor PASS! A special thanks to the folks at PASS for all of their help. If you’re in the Eastern Iowa region and interested in attending or volunteering, please e-mail me at michelle @ sqlfool dot com.

Also, I’ve received a couple of comments about my latest codebox plugin. It appears that in an attempt to fix one issue, I’ve caused another. I plan to work on a solution this weekend, but in the mean-time, if you try to copy code from the codeboxes, you’ll need to do a find/replace on the single quotation marks. I apologize for the hassle and hope to have this resolved soon.

Finally, the very knowledgeable and prolific Mr. Denny wrote a post regarding Wordle and challenged me to post the results of my blog. I’m just now getting caught up from vacation, so I’m a little late in responding, but here it is:

Wordle.com

Wordle.com

Pretty neat, huh? Thanks, Mr. Denny, for sharing! And to keep the fun going, I’m tagging Brent Ozar and Jonathan Kehayias (The Rambling DBA).

Update:
The codebox plugin issue should now be resolved. Please let me know if you have any issues with it.