Accelerating ETL Data Retrieval

As I’ve mentioned previously, SQL Server 2012 Integration Services Design Patterns is now available on Amazon. PASS has invited the ENTIRE author team to give a pre-conference training session at Summit 2012. Precons are all-day training events and are held on the days before Summit. Our precon will be held on Monday, November 5th. I’ve attended several precons at Summit, and in my opinion, they provide unparalleled depth and access to the presenters. I expect our precon will be no different.

So what will I be presenting on? Good question. I’m going to talk about Accelerating ETL Data Retrieval. Specifically, I’ll discuss strategies for retrieving data for full loads vs. incremental loads, and I’ll explore some strategies for retrieving data from large tables. To clarify “large,” since everyone has a different definition, I’ve successfully employed these strategies on tables with 10 billion rows.

Now you might be reading this and thinking, “Hey, that doesn’t sound like SSIS!” Well… yes and no. While I will be discussing some strategies within SSIS, most of what I discuss will take place at the database layer. In fact, most of my content could be useful for any ETL tool. I still think it’s a good topic for this precon for three reasons. First, most ETL developers I talk with — that is, people who specialize in and are primarily tasked with ETL — overlook the data retrieval step. I frequently hear, “I’m pulling all of the data, so it doesn’t matter, I can’t get the data any faster than SELECT * FROM.” That’s not always true, and I’ll explain why in the precon. Secondly, having a thorough understanding of data retrieval techniques is important. Data retrieval is one of the most — if not the single most — expensive components of ETL. Lastly, I think there’s a huge knowledge gap in this area. While there is a lot of discussion about query optimization, it rarely has to do with the type of query patterns that ETL requires.

So that’s what I’ll be talking about and why. What do you think? Worthwhile content?

Also, this is just one of the things that will be covered in our SSIS Design Patterns precon. The rest of the author team — Andy Leonard, Matt Masson, Tim Mitchell, and Jessica Moss — is putting together great content that includes package execution, parent/child relationships, expressions and scripting, dynamic configurations, error handling, data flow internals, scalability and parallelism, automation, dynamic package generation, data warehousing patterns, and more!

As an aside, I am honored to announce that I have been re-awarded as Microsoft MVP for SQL Server. My husband took this picture of my trophy for me, which I think turned out really cool. :)

How To Estimate Data Utilization

Recently, on a conference call presenting data growth rates and database capacity projections, I had a top-line executive ask, “But how much of that data are we actually using today?” The question was met with silence; unless you have rigorous auditing in place — and kudos to you if you do — it’s a difficult question to answer. But it begs the question, is there some way to gleam this information from SQL Server? I think the answer is “yes,” if you make some assumptions and understand what you’re looking at.

SQL Server collects stats about every time an index is used and how it is used (i.e. whether a user seeked or scanned the index, etc.). It also provides a DMV to view these stats: sys.dm_db_index_usage_stats.

This DMV provides a wealth of great information, but to answer our question of “What data is actually being used?”, we have to refine our criteria. Are we talking in terms of table counts or data size? I’d argue that data size is more important than table counts; one unqueried millow-row table is more wasteful than a hundred ten-row tables.

Also, are we looking at indexes or content? From a database perspective, I’m more interested in indexes: how much space are we wasting on unused indexes? To identify this, I need to look at the activity on each individual index.

From a business perspective, I would be more interested in content (i.e. tables): how much business information is being stored that no one is even looking at? To answer this question, I need to roll up all index usage to see if *any* of the indexes on a table were used. Since both were of interest to me, I decided to write queries to answer both questions.

Lastly, we need to understand the flaws with this data. Chiefly, I cannot tell whether a user requested one row from a million-row table, or if [s]he needed all of the data in the table. This is a pretty important issue, especially with large historical data stores, and it’s where I have to make the biggest assumption: if even one person looked at one row in the table, I count all pages in the table as having been accessed.

Now, you may make different decisions than I did above, and that’s fine… each environment and project has different needs. But these assumptions are very important to understanding the output of the query below:

USE master;
GO
 
/* 
    This will give you an approximation of how much data is being utilized on a server.
    Since the data is only valid as of the last server reboot, we should start off with
    an idea of how much data we've accrued.  
*/
 
/* Find out when the server was last rebooted */
-- 2008
SELECT sqlserver_start_time FROM sys.dm_os_sys_info;
-- 2005
SELECT create_date FROM sys.databases WHERE name = 'tempdb';
 
 
/* Create a temporary table to hold our data, since we're going to iterate through databases */
IF OBJECT_ID('tempdb..#Results') IS NOT NULL
    DROP TABLE #Results;
 
CREATE TABLE #Results
(
      databaseName  NVARCHAR(128)
    , tableName     NVARCHAR(128)
    , indexID       INT
    , records       BIGINT
    , activity      BIGINT
    , totalPages    BIGINT
);
 
/*  
    sp_foreachdb was written by SQL MVP Aaron Bertrand and can be downloaded 
    at http://www.mssqltips.com/sqlservertip/2201/making-a-more-reliable-and-flexible-spmsforeachdb/
    Alternatively, you can also use sys.sp_MSforeachdb
*/
--EXECUTE master.dbo.sp_foreachdb
EXECUTE sys.sp_MSforeachdb
'   USE ?; 
 
    -- You can gleam a lot of information about historical data usage from partitions
    -- but for now, we will just roll up any partitions we may have
    WITH myCTE AS
    (
        SELECT p.[object_id] AS objectID
            , p.index_id
            , SUM(p.[rows]) AS records
            , SUM(au.total_pages) AS totalPages
        FROM sys.partitions AS p WITH (NOLOCK)
        JOIN sys.allocation_units AS au WITH (NOLOCK)
            ON p.hobt_id = au.container_id
        GROUP BY p.[object_id] 
            , p.index_id
    )
 
    -- Grab all tables and join to our usage stats DMV
    INSERT INTO #Results
    SELECT DB_NAME() AS databaseName
        , t.name
        , x.index_id
        , MAX(x.records) AS records
        , ISNULL(SUM(us.user_lookups + us.user_scans + us.user_seeks), 0) AS activity
        , SUM(x.totalPages) AS totalPages
    FROM sys.tables AS t WITH (NOLOCK)
    JOIN myCTE AS x
        ON t.[object_id] = x.objectID
    LEFT JOIN sys.dm_db_index_usage_stats AS us WITH (NOLOCK)
        ON us.[object_id] = x.objectID
        AND us.index_id = x.index_id
        AND us.database_id = DB_ID()
    GROUP BY t.name
    , x.index_id;'
 
/* Because we're looping through databases, make sure we're not missing any major ones */
SELECT * FROM sys.databases WHERE name NOT IN (SELECT databaseName FROM #Results);
 
/* Retrieve actual % data utilization, which is performed at the index level */
SELECT databaseName
    , SUM(queriedPages) AS TotalQueriedPages
    , SUM(totalPages) AS TotalPages
    , CAST(SUM(queriedPages) AS FLOAT) / REPLACE(SUM(totalPages), 0, 1) AS '%DataUtil'
FROM (
    SELECT databaseName
        , tableName
        , indexID
        , CASE -- If we have any activity at all on an index, count it as activity
            WHEN activity = 0 THEN 0.0
            ELSE totalPages
          END AS queriedPages
        , totalPages
    FROM #Results
    WHERE databaseName NOT IN ('master', 'tempdb', 'msdb', 'model')
) x
GROUP BY databaseName
ORDER BY databaseName;
 
/* Retrieve % content utilization, which is performed at the table level */
SELECT databaseName
    , SUM(queriedPages) AS TotalQueriedPages
    , SUM(totalPages) AS TotalPages
    , CAST(SUM(queriedPages) AS FLOAT) / REPLACE(SUM(totalPages), 0, 1) AS '%ContentUtil'
FROM (
    SELECT databaseName
        , tableName
        , MAX(records) AS records
        , CASE WHEN SUM(activity) > 0 THEN SUM(totalPages) ELSE 0 END AS queriedPages
        , SUM(totalPages) AS totalPages
    FROM #Results
    WHERE databaseName NOT IN ('master', 'tempdb', 'msdb', 'model')
    GROUP BY databaseName
        , tableName
) x
GROUP BY databaseName
ORDER BY databaseName;

Results:

databaseName               TotalQueriedPages   TotalPages           %DataUtil
-------------------------- ------------------- -------------------- ----------------------
Database1 		   127618701           130607247            0.969619893356378
Database2 		   567188              1614958              0.351209133612143
Database3 		   34269036            34579469             0.991022620966216
Database4 		   137970594           170733391            0.803399928206158
Database5 		   74632930            101543575            0.66909214627557
Database6 		   55809933            72884205             0.765734157938039
Database7 		   560810026           620609815            0.902175272517656
 
databaseName               TotalQueriedPages   TotalPages           %ContentUtil
-------------------------- ------------------- -------------------- ----------------------
Database1 		   127763715           130607247            0.970721679051682
Database2 		   571125              1614958              0.353646967908763
Database3 		   34269036            34579469             0.991022620966216
Database4 		   137970921           170733391            0.803399928206158
Database5 		   96144726            101543575            0.861947682777784
Database6 		   72269666            72884205             0.991568146820268
Database7 		   620525938           620609815            0.998240279711804

The first result set examines the utilization of indexes, and the second result set examines the utilization of data at the content (table) level. For example, if we look at Database6, we’ll see that we are only utilizing 77% of our indexes, but we’re looking at 99% of our table data. So this is a good indicator that we have some unused indexes to clean up in that database.

Know a better way to answer this question using SQL Server DMV’s? Please leave me a comment so I can learn from your experience. :)

In unrelated news, this may be my last blog post for a little while. I’m due with my second child a week from today and expect all of my free time to be consumed by him for a little while. That and, quite frankly, I do not trust myself near a computer, especially a database, in such a sleep-deprived state. :)

Yet Another PASS Summit Recap & Thoughts on PDW

The SQL blogosphere has been lit up with PASS Summit recaps.

I debated about whether or not to write my own post, until I remembered that this blog serves as a mini-journal for me too. I have a notoriously poor memory–my husband likes to say that my CPU and memory are good, but I must have an unusual clustering strategy–so maybe this blog post will be a good pointer for me when I start prepping for next year’s Summit. ;)

This was definitely the best PASS Summit conference ever. While there will always be opportunities to do things better–improvement is a never-ending process–it was clear that the organizers of this event listened to the feedback they had received the previous year. One of the best changes? Backpacks. These were very useful, as evidenced by their presence everywhere. Nice job, organizers!

My absolute favorite thing about Summit is the chance to meet and reconnect with so many amazing SQL folks. There were entirely too many people to list out, but some highlights include meeting Crys Manson, Jorge Segarra, and Karen Lopez for the first time. I also had a chance encounter with Ola Hallengren in the Sheraton elevator. Apparently we were only staying a few rooms apart this year. We ended up having a couple of really great discussions about index fragmentation, the differences between our scripts, and things we’d like to see changed in future releases of SQL Server.

I had the opportunity to sit on the panel at the WIT luncheon. All of the women on the panel were amazing, and I was honored just to be sitting at the same table as them. I was especially pleased to meet Nora Denzel, a Senior Vice President at Intuit. Intelligent, confident, and witty, she is a great role model for young technical women, myself included. I can only hope that some of her gumption rubbed off on me due to our close proximity. :) After the event, I was pleasantly surprised by how many folks–men and women both–came up to me to tell me how much they enjoyed it. Thanks to the WIT VC for organizing another great event!

The lightning talk sessions were a new feature this year, and I think I like it. The format of the lightning session is 7 speakers presenting on a topic for 5 quick minutes. Watching these sessions is kind of like skipping right to the center of a tootsie pop: all content and no fluff. The standout lightning talk presentation for me was Adam Machanic’s. It was beautifully rehearsed and choreographed. Nice job, Adam!

Another of the many highlights of the week was meeting the Microsoft execs. In addition to meeting Ted Kummert, Mark Souza, and Donald Farmer–all very nice gentlemen–I had the opportunity to speak at length with Jose Blakely about Parallel Data Warehouse (PDW). PDW, formerly codenamed Madison, was officially launched at Summit. Jose was kind enough to explain the PDW architecture, both where it came from and the vision for where it’s going. I’d attempt to regurgitate it here, but I think the probability of me misquoting would be high.

Suffice it to say, this technology has me excited. Why? Quite frankly, I think PDW will do for data warehousing what SQL Server did for databases, and what Analysis Services did for BI: make it affordable. With a compelling cost-per-terabyte, an attractive scale-out approach, and an entry point at under $1 million, we’ll see more small-to-midsized companies implementing data warehousing and business intelligence. This is good news for those of us looking for an affordable data warehouse solution and for those of us who make our living with SQL Server. And for those of you who might suggest that few companies need a datawarehouse that can support multi-terabyte data, I’d like to point out that just 3 or 4 years ago, 100 GB was considered a lot of data.

I spent most of my week digging into the PDW architecture. It’s not all roses–it’s a first release and, as such, is immature compared to the much older and more established data warehouse systems–but again, it has a lot going for it, not least of all it’s easy integration within a SQL Server environment and the relatively low cost. We’re currently investigating this as a possible data warehouse solution for our business intelligence environment, so expect to see more from me about PDW as I learn more about it.

Why I’m Blogging Less

I’ve received a few questions asking why I’ve been blogging less frequently, and even one inquiry after my health. Rest assured, I’m completely fine. But there are 2 perfectly good reasons why I’ve been blogging less these days.

East Iowa SQL Saturday:

I’m the event organizer for East Iowa SQL Saturday, which is eating up a lot of my free time. If you haven’t yet heard about our SQL Saturday event, let me give you a brief overview. It’s a FREE, one-day training event geared toward SQL Server professionals and anyone who wants to learn more about SQL Server. We have 22 sessions planned covering a variety of topics, from Business Intelligence to Disaster Recovery to SQL Server 2008 topics. And if you’re a .NET developer, we also have some .NET-related presentations, including PowerShell and MVC.

We’re very fortunate to have snagged an excellent set of speakers. Jessica Moss, Louis Davidson, Timothy Ford, Jason Strate, and Alex Kuznetsov are just a few of the great speakers we have lined up.

There’s only a handful of spots left, so if you’re interested in attending, you should register soon. To find out more details about the speakers and sessions, or to register, be sure to check out our website at http://sqlsaturday.380pass.org.

The Other Reason:

baby_uff

Yes, that’s right, I’m with child. Expecting. Eating for two. Bun in the oven. In the family way. You get the idea.

So when I’m not at work, planning SQL Saturday, or playing Civilization Revolution, I’m sleeping. For those who remotely care, I’m due around Super Bowl time in February 2010.

2010: The Year I Make Contact

2010: The Year I Make Contact

Rest assured, this blog isn’t going away. And hopefully once I get through SQL Saturday and then PASS Summit, I’ll have more free time again. :)

Getting Started with Variables in SSIS

I recently had to create an SSIS package that used variables to pass data between procs, and I thought it would make a good topic for a blog post. There are many scenarios as to the how and why to use variables in SSIS, but we’re going to keep it pretty simple. Let’s assume we have some need to retrieve data from Proc A, pass it to Proc B, and store the results in Table C. First, let’s set up our environment:

Use AdventureWorks;
Go
 
Create Procedure dbo.LastOrderGet_sp
As
Set NoCount On;
Begin
 
    Select Max(SalesOrderID) As 'LastSalesOrderID'
    From AdventureWorks.Sales.SalesOrderHeader With (NoLock)
 
    Set NoCount Off;
    Return 0;
End
Go
 
Create Procedure dbo.ProcessLastOrder_sp
    @LastOrderID int
As
Set NoCount On;
Begin
 
    Select SalesOrderDetailID
        , ProductID
        , OrderQty
        , LineTotal
    From AdventureWorks.Sales.SalesOrderDetail With (NoLock)
    Where SalesOrderID = @LastOrderID;
 
    Set NoCount Off;
    Return 0;
End
Go
 
Create Table dbo.testStage
(
      SalesOrderDetailID    int
    , ProductId             int
    , OrderQty              smallint
    , LineTotal             numeric
);
Go

Now for the fun stuff!

(Please note, I’m assuming some basic understanding of SSIS, so I’m skipping the “how to create a project”, etc. stuff and just going to the pertinent parts).

Inside BIDS (Business Intelligence Development Studio), create a new SSIS project and call it what you will. If your Variable window is not already open, open it now by going to View –> Other Windows –> Variables.

Open Variables Window

Open Variables Window

Now let’s create a variable. To do this, click on the little icon in the upper left-hand corner of the Variables window. Name the variable LastSalesOrderID.

Create a variable

Create a variable

After you create the variable, you should now see it in the Variables window. Make sure the scope of the variable is the name of your project, which is “Blog” in my case (for obvious reasons); this means the variable is defined at the package scope. Once you’ve confirmed that the variable exists, create an Execute SQL task.

(Variables in SSIS, like in other programming languages, can have different scopes. For instance, a package scope means the variable can be accessed anywhere within the package, but a variable with a Data Flow scope can only be accessed within the specified Data Flow task.)

Create Execute SQL Task

Create Execute SQL Task

Double-click on your Execute SQL Task and configure with the following values:

  • Set “Result Set” to Single Row.
  • Set your Connection to your appropriate data source.
  • Set your SQL Statement to: Execute AdventureWorks.dbo.LastOrderGet_sp;
Set up your Execute SQL Task

Set up your Execute SQL Task

Now click on “Result Set” and click on “Add.” You’ll want to put the name of the column that’s returned by the proc in the “Result Name” column; in our case, that’ll be LastSalesOrderID. Click on the Variable Name column and scroll down until you find the appropriate one (User::LastSalesOrderID).

Mapping the results to a variable

Mapping the results to a variable

Go ahead and add a Data Flow task to the designer surface. We don’t need to use a Data Flow task here — for example, we could use another Execute SQL task instead — but this will help demonstrate one way to use variables.

Add Data Flow Task

Add Data Flow Task

Double-click on the Data Flow task and add an OLE DB Source, then double-click on it to open up the properties. Enter the following text in the “SQL Command text” window:
Execute AdventureWorks.dbo.ProcessLastOrder_sp ?
The question mark (?) tells SSIS to expect a parameter.

Edit OLE DB Source

Edit OLE DB Source

Now click on the Parameters button on the left. This is where we map our variable to our parameter. For the “Parameters” value, enter @LastOrderID (the parameter the stored procedure is expecting). In the “Variables” column, click on the drop-down and navigate to the User::LastSalesOrderID variable.

Map Variables to Parameters

Map Variables to Parameters

Finally, set up an OLE DB Destination, and configure the OLE DB Source to load into the testStage table.

Configure OLE DB Destination

Configure OLE DB Destination

At this point, you should be able to successfully execute your package. Upon successful execution, the testStage table will return the following results:

Select * From testStage;
 
SalesOrderDetailID ProductId   OrderQty LineTotal
------------------ ----------- -------- ------------------
121315             878         1        21
121316             879         1        159
121317             712         1        8

That’s all for now. Hopefully this gives you an idea of how easy and useful it is to work with variables in SSIS.

Max Parallelism on Cube Processing

The default behavior for processing cubes is to let the server determine how much parallelism to use. Let’s face it, the server must’ve not learned to play nice with others as a kid; as such, it doesn’t always make the best decisions. But fear not, you can help it along.

When processing a cube, click on “Change Settings…”

SSAS Cube Processing

SSAS Cube Processing

Now change the “Maximum parallel tasks” to 1, 2, or whatever is appropriate for your environment:

SSAS Cube Processing Settings

SSAS Cube Processing Settings

That’s all there is to it. Happy cube processing!

Hello (BI) World!

For those who are not already aware, I’ve recently switched to the Business Intelligence team as an OLAP developer. I’m pretty excited about the opportunity to learn more about cubes and data mining, and I’ve decided it’s time to start sharing some of what I’m learning. My initial BI-related blog posts will probably be more entry-level topics as I learn my way around BIDS (Business Intelligence Development Studio).

To get started, I’d like to share my current favorite BI resource website: http://www.learnmicrosoftbi.com. LearnMicrosoftBI is an excellent, FREE website run by Craig Utley. Yes, you heard me right… you have to register, but the site is FREE. There are currently 38 videos on SSAS and BI-related topics, ranging from 7 minutes to 58 minutes long. I haven’t watched them all, but the ones I have watched have been helpful. If you’re trying to learn SSAS, be sure to check this site out.

Here’s a couple of other items of note: