Accelerating ETL Data Retrieval

As I’ve mentioned previously, SQL Server 2012 Integration Services Design Patterns is now available on Amazon. PASS has invited the ENTIRE author team to give a pre-conference training session at Summit 2012. Precons are all-day training events and are held on the days before Summit. Our precon will be held on Monday, November 5th. I’ve attended several precons at Summit, and in my opinion, they provide unparalleled depth and access to the presenters. I expect our precon will be no different.

So what will I be presenting on? Good question. I’m going to talk about Accelerating ETL Data Retrieval. Specifically, I’ll discuss strategies for retrieving data for full loads vs. incremental loads, and I’ll explore some strategies for retrieving data from large tables. To clarify “large,” since everyone has a different definition, I’ve successfully employed these strategies on tables with 10 billion rows.

Now you might be reading this and thinking, “Hey, that doesn’t sound like SSIS!” Well… yes and no. While I will be discussing some strategies within SSIS, most of what I discuss will take place at the database layer. In fact, most of my content could be useful for any ETL tool. I still think it’s a good topic for this precon for three reasons. First, most ETL developers I talk with — that is, people who specialize in and are primarily tasked with ETL — overlook the data retrieval step. I frequently hear, “I’m pulling all of the data, so it doesn’t matter, I can’t get the data any faster than SELECT * FROM.” That’s not always true, and I’ll explain why in the precon. Secondly, having a thorough understanding of data retrieval techniques is important. Data retrieval is one of the most — if not the single most — expensive components of ETL. Lastly, I think there’s a huge knowledge gap in this area. While there is a lot of discussion about query optimization, it rarely has to do with the type of query patterns that ETL requires.

So that’s what I’ll be talking about and why. What do you think? Worthwhile content?

Also, this is just one of the things that will be covered in our SSIS Design Patterns precon. The rest of the author team — Andy Leonard, Matt Masson, Tim Mitchell, and Jessica Moss — is putting together great content that includes package execution, parent/child relationships, expressions and scripting, dynamic configurations, error handling, data flow internals, scalability and parallelism, automation, dynamic package generation, data warehousing patterns, and more!

As an aside, I am honored to announce that I have been re-awarded as Microsoft MVP for SQL Server. My husband took this picture of my trophy for me, which I think turned out really cool. :)

BCP Basics

In this blog post, I’m going to walk through the basics of BCP (bulk copy program). BCP is a utility that installs with SQL Server and can assist with large data transfers.

Let’s see what parameter options are available to use. From the command line on a machine with SQL Server installed, type “bcp” and press Enter.

BCP Parameters

You can find out more information on BCP parameters on Books Online: http://msdn.microsoft.com/en-us/library/ms162802.aspx

For now, we’re going to examine just the basics. The simplest syntax of a BCP command is:

bcp
databaseName.Schema.TableName *or* “Query”
in, out, *or* queryout
-S ServerName\instanceName
-U userName -P password *or* -T
-c *or* -n *or* specify storage information for each column

Let’s look at these options in a little more detail:

databaseName.Schema.TableName *or* Query
You can specify either an entire table to copy or a query. The query should be surrounded in quotations and must also include the fully qualified table name.

in, out, *or* queryout
in = import, out = full table export, queryout = query to select data for export

-U userName -P password *or* -T
You can either specify a specific account to access SQL Server, or use -T to indicate Trusted Connection (i.e. Windows Authentication)

-c *or* -n *or* specify storage information for each column
-c indicates character data type, -n indicates native data type; if neither one is specified, by default you will be prompted for the data type for each column.

Now let’s put this together and run some BCP commands. All of these examples will use the AdventureWorks 2008 sample database.

First, let’s export an entire table. To do this, we’ll use the “out” parameter.

bcp AdventureWorks.Sales.SalesOrderDetail out
C:\bcp_outputTable.txt -SYourServerName -T -c

 

Export Table with BCP

Export Table with BCP - Results

I don’t normally export an entire table… or at least, not in one process. So let’s walk through what it would look like to export the same table using a query. This will use the “queryout” parameter.

 

bcp "Select SalesOrderID, SalesOrderDetailID, OrderQty, ProductID
From AdventureWorks.Sales.SalesOrderDetail" queryout
C:\bcp_outputQuery.txt -SYourServerName -T -c

 

Export Query with BCP

Export Query with BCP - Results

You’ll notice that the total duration for the query was shorter than for the full-table export. This is because we’re only exporting a few of the columns. This is important to keep in mind when bcp’ing data: you’ll get better performance if you only export the data elements that you actually need.

Now that we’ve exported some data, let’s walk through the process of importing this data. First, let’s create a table with a constraint that will result in some errors.

Create Table dbo.testBCPLoad
(
      SalesOrderID          int      Not Null
    , SalesOrderDetailID    int      Not Null
    , OrderQty              smallint Null
    , ProductID             int      Null
 
    Constraint PK_testBCPLoad
        Primary Key Clustered
        (SalesOrderID)
);

Now execute the BCP import command:

bcp sandbox.dbo.testBCPLoad in
C:\bcp_outputQuery.txt -SYourServername -T -c

 

Load Data with BCP

Load Data with BCP - Error

You should receive a Primary Key error. When you check your results in SQL Server, you should find no results loaded into the table. This is BCP’s default behavior.

Check Destination Table

Let’s change our constraint and try the same BCP command again:

Alter Table dbo.testBCPLoad
    Drop Constraint PK_testBCPLoad;
 
Alter Table dbo.testBCPLoad
    Add Constraint PK_testBCPLoad
    Primary Key Clustered
        (SalesOrderID, SalesOrderDetailID);

 

bcp sandbox.dbo.testBCPLoad in
C:\bcp_outputQuery.txt -SYourServername -T -c

You should now have the data loaded into your SQL Server destination table:

Import Data with BCP - Results

Destination Table

So there you have it, the basics of BCP! :)

A few BCP tips:

  • BCP commands are case-sensitive!
  • If you’re accessing the data across a WAN, perhaps via a VPN connection, try to remote desktop (mstsc) to the actual SQL Server to perform the BCP. If possible, keep the operation on the same local drive or even local network as the server; the less distance data needs to travel across a network, the faster BCP will perform.
  • If you need to copy large amounts of data (i.e. >100mm rows), try breaking the data into smaller chunks. This will help if you have an error during BCP (i.e. a PK error can rollback the entire import operation by default, although there are options that can change this behavior). When working with partitioned tables, I find it very efficient to segregate the data imported/exported by partition.
  • If you’re BCP’ing data into a new table, you can minimize impact on the server by waiting to create your indexes after all the data is loaded.
  • I like to construct my queries in SSMS, then copy them to BCP. Since the command-line utility does not support copy and pasting, I create a text file with my BCP command in NotePad, then save the command as a .cmd. To execute, just call the .cmd file.

Bulk Inserts with XML

Last week, I blogged about how to perform one-to-many inserts with table-valued parameters, a feature new in 2008. For those who do not yet have 2008 or will not have it in the near future, it may still be beneficial to use XML for bulk inserts.

Here’s a pretty simple example of how to accomplish this:

/* Create some tables to work with */
CREATE TABLE dbo.orders
(
      order_id      INT IDENTITY(1,1)   Not Null
    , orderDate     DATETIME            Not Null
    , customer_id   INT                 Not Null
 
    CONSTRAINT PK_orders
        PRIMARY KEY CLUSTERED(order_id)
);
 
CREATE TABLE dbo.orderDetails
(
      orderDetail_id    INT IDENTITY(1,1)   Not Null
    , order_id          INT                 Not Null
    , lineItem          INT                 Not Null
    , product_id        INT                 Not Null
 
    CONSTRAINT PK_orderDetails
        PRIMARY KEY CLUSTERED(orderDetail_id)
 
    CONSTRAINT FK_orderDetails_orderID
        FOREIGN KEY(order_id)
        REFERENCES dbo.orders(order_id)
);
Go
 
/* Create a new procedure using an XML parameter */
CREATE PROCEDURE dbo.insert_orderXML_sp
      @orderDate        DATETIME
    , @customer_id      INT
    , @orderDetailsXML  XML
AS
BEGIN
 
    SET NOCOUNT ON;
 
    DECLARE @myOrderID INT;
 
    INSERT INTO dbo.orders
    (
          orderDate
        , customer_id    
    )
    VALUES
    (
          @orderDate
        , @customer_id
    );
 
    SET @myOrderID = SCOPE_IDENTITY();
 
    INSERT INTO dbo.orderDetails
    (
          order_id
        , lineItem
        , product_id
    )
    SELECT @myOrderID
         , myXML.value('./@lineItem', 'int')
         , myXML.value('./@product_id', 'int')
    FROM @orderDetailsXML.nodes('/orderDetail') As nodes(myXML);
 
    SET NOCOUNT OFF;
 
END
GO
 
 
/* Call our stored procedure */
EXECUTE dbo.insert_orderXML_sp
      @orderDate = '2008-01-01'
    , @customer_id = 101
    , @orderDetailsXML = 
        '<orderDetail lineItem="1" product_id="123" />
         <orderDetail lineItem="2" product_id="456" />
         <orderDetail lineItem="3" product_id="789" />
         <orderDetail lineItem="4" product_id="246" />
         <orderDetail lineItem="5" product_id="135" />';
 
 
/* Check our data */
SELECT * FROM dbo.orders;
SELECT * FROM dbo.orderDetails;
 
 
/* Clean up our mess */
DROP PROCEDURE insert_orderXML_sp;
DROP TABLE dbo.orderDetails;
DROP TABLE dbo.orders;

I’ve found that this is more efficient when performing large parent/child inserts, i.e. 1 parent record to 100 child records. Keep in mind that there’s a point where doing an insert with XML is more expensive than using a traditional INSERT stored procedure. I haven’t run any tests yet to help define what that threshold is… more to come on this in the near future.

More on the Nodes() Method can be found here in Books Online: http://msdn.microsoft.com/en-us/library/ms188282(SQL.90).aspx

Update: I’ve just learned that the “value” keyword is case-sensitive. Apparently my code box plug-in was defaulting “value” to “VALUE.” :)

Here’s the error message you’ll get if you don’t have “value” in lower-case:
Cannot find either column “myXML” or the user-defined function or aggregate “myXML.VALUE”, or the name is ambiguous.

Performance Comparison of Singleton, XML, and TVP Inserts

As promised, today I took a look at the performance of bulk inserts using XML and Table-Valued Parameters. I also compared it against singleton inserts to show the value in the bulk-insert approach.

My tests were pretty simple: insert 100 records using each method. Each test was executed 10 times to ensure consistency. The duration was recorded in microseconds.

The goal was to compare the performance of the inserts. Because I was executing this entire test within SQL Server, I had to isolate only the actual insert transactions and ignore everything else, such as the loading of the data; that work would normally be performed by the calling application.

So without further ado… screenshots of the Profiler traces: (click to enlarge)

TVP

Single Insert Method

TVP

XML Method

TVP

Table-Valued Parameter Method

Summary

Method Avg CPU Avg Reads Avg Writes Avg Duration (micro)
Singleton Method 3 202 0 13378
XML Method 0 222 0 3124
TVP Method 1 207 0 780

 

As expected, both the XML and the TVP method performed significantly better than the single-insert method. As hoped, the table-valued parameter arguably performed the best of all 3.

Large Updates on Replicated Tables

Late last night, I executed a 70mm update on a replicated table. This was a somewhat low priority update, so the primary goal (aside from the data change) was to avoid any issues on the publishing and subscribing servers, including replication latency errors. I have performed many large updates in the past, but this was the first on a replicated table.

To minimize impact, one of our system DBA’s suggested the use of a replicated stored procedure. Basically, instead of replicating each of the 70mm updates as a singleton transaction, the replicated stored procedure is called on the subscribing server, which then performs the bulk update locally. This was my first time using it and it worked beautifully.

Another of my colleagues, Jeff M., suggested the use of a control table. Normally, I would just output the last affected ID and update the script manually. However, this also worked so well that it will be adopted in all future update scripts.

Using the following pseudo-script and a replicated stored procedure, I was able to execute the update on 70mm records in 11.5 hours with *zero* impact on the servers and replication. I could’ve probably increased the batch size and reduced the execution time even further, but as I mentioned, this was a low priority update, so there was no need to push it.

This process should work equally well on non-replicated updates; merely replace the Execute statement with the actual update.

/************ Chunked Update Script with Control Table ************/
 
/* --------------------- Preparation Script --------------------- */
 
/* Note: Execute this section in a separate query window */
 
Use yourDatabase;
Go
 
Set NoCount On;
 
/* Create a control table to facilitate tweaking of parameters */
Create Table dbo.scratch_largeUpdate_control
(
      sizeOfBatch       int
    , waitForDelay      char(8)
    , minRecordToUpdate int 
    , maxRecordToUpdate int
);
 
/* Create your control data; you only want 1 row in this table */
Insert Into dbo.scratch_largeUpdate_control 
(sizeOfBatch, waitForDelay, minRecordToUpdate, maxRecordToUpdate)
Select 10000, '00:00:05', 40297132, 107459380;
 
/* Update Script */
Update dbo.scratch_largeUpdate_control
Set sizeOfBatch  = 100000
  , waitForDelay = '00:00:30';
 
 
/* ------------------------ Update Script ------------------------ */
Use yourDatabase;
Go
 
Set NoCount On;
 
Declare 
      @batchSize        int
    , @minID            int
    , @maxID            int
    , @procMinID        int
    , @procMaxID        int
    , @delay            char(8)
    , @statusMsg        varchar(1000);
 
Begin Try
 
    IF @@SERVERNAME Not In ('PRODSERVER')
        RAISERROR('Sorry, this cannot be executed here!', 16, 1);
 
    If Not Exists(
            Select object_id 
            From sys.objects 
            Where [name] = 'scratch_largeUpdate_control' 
            And type = 'U' )
        RaisError ('ERROR: Control table does not exist!', 16, 1)
            With NoWait;
    Else
        Select 
              @minID        = minRecordToUpdate 
            , @maxID        = maxRecordToUpdate 
            , @batchSize    = sizeOfBatch
            , @delay        = waitForDelay
        From dbo.scratch_largeUpdate_control With (NoLock);
 
    Set @statusMsg = 'Beginning update; batch size of ' 
        + Cast(@batchSize As varchar(10)) + ', delay of ' 
        + @delay + ' defined.  Estimate ' 
        + Cast((((@maxID - @minID) / @batchSize) + 1) As varchar(10)) 
        + ' iterations to be performed.'
 
    RaisError (@statusMsg, 10, 1) With NoWait;
 
    While @minID < @maxID
    Begin
 
        Select @procMinID = @minID
            , @procMaxID = (@minID + (@batchSize - 1));
 
        /* Execute actual update code here 
           OR 
           Call a replicated stored procedure, i.e. */
        Execute dbo.myReplicatedUpdateProc 
              @minRecordID = @procMinID
            , @maxRecordID = @procMaxID;
 
        Set @statusMsg = 
            'Updating records ' + Cast(@minID As varchar(10)) 
            + ' through ' + Cast((@minID + (@batchSize - 1)) 
            As varchar(10)) + '...';
 
        RaisError (@statusMsg, 10, 1) With NoWait;
 
        /* Update our control table with the last successfully
           updated record ID.  In the event of an error,
           we can start from here. */
        Update dbo.scratch_largeUpdate_control 
        Set minRecordToUpdate = @minID + @batchSize;
 
        Select @minID = @minID + @batchSize; 
        WaitFor Delay @delay; -- breather for the server
 
        /* Check to see if our control values have changed */
        If Not Exists(
            Select * 
            From dbo.scratch_largeUpdate_control With (NoLock) 
            Where @batchSize = sizeOfBatch And @delay = waitForDelay)
        Begin
 
            /* There was a change, so grab our new values */
            Select @batchSize = sizeOfBatch
                 , @delay = waitForDelay
            From dbo.scratch_largeUpdate_control With (NoLock)
 
            /* Print a status message with the new values */
            Set @statusMsg = 'Parameters changed:  batch size = ' 
                + Cast(@batchSize As varchar(10)) 
                + ', delay = ' + @delay;
 
            RaisError (@statusMsg, 10, 1) With NoWait;
 
        End
    End
 
    RaisError ('Success!', 10, 1) With NoWait;
 
End Try
/* Handle your errors */
Begin Catch
 
        Set @statusMsg = 'An error has occurred and the last '
                         + 'transaction has been rolled back. '
                         + 'Last record successfully updated was '
                         + 'record_id = ' 
                         + Cast((@minID + (@batchSize - 1)) 
                            As varchar(10));
 
        RaisError (@statusMsg, 16, 1) With NoWait;
 
        /* Return the error message */
        Select Error_Number()
            , Error_Procedure()
		    , DB_Name()
		    , Error_Line()
		    , Error_Message()
		    , Error_Severity()
		    , Error_State();
 
End Catch;
 
/* -------------------------- Clean-Up ----------------------------
Drop Table dbo.scratch_largeUpdate_control;
----------------------------------------------------------------- */