Accelerating ETL Data Retrieval

As I’ve mentioned previously, SQL Server 2012 Integration Services Design Patterns is now available on Amazon. PASS has invited the ENTIRE author team to give a pre-conference training session at Summit 2012. Precons are all-day training events and are held on the days before Summit. Our precon will be held on Monday, November 5th. I’ve attended several precons at Summit, and in my opinion, they provide unparalleled depth and access to the presenters. I expect our precon will be no different.

So what will I be presenting on? Good question. I’m going to talk about Accelerating ETL Data Retrieval. Specifically, I’ll discuss strategies for retrieving data for full loads vs. incremental loads, and I’ll explore some strategies for retrieving data from large tables. To clarify “large,” since everyone has a different definition, I’ve successfully employed these strategies on tables with 10 billion rows.

Now you might be reading this and thinking, “Hey, that doesn’t sound like SSIS!” Well… yes and no. While I will be discussing some strategies within SSIS, most of what I discuss will take place at the database layer. In fact, most of my content could be useful for any ETL tool. I still think it’s a good topic for this precon for three reasons. First, most ETL developers I talk with — that is, people who specialize in and are primarily tasked with ETL — overlook the data retrieval step. I frequently hear, “I’m pulling all of the data, so it doesn’t matter, I can’t get the data any faster than SELECT * FROM.” That’s not always true, and I’ll explain why in the precon. Secondly, having a thorough understanding of data retrieval techniques is important. Data retrieval is one of the most — if not the single most — expensive components of ETL. Lastly, I think there’s a huge knowledge gap in this area. While there is a lot of discussion about query optimization, it rarely has to do with the type of query patterns that ETL requires.

So that’s what I’ll be talking about and why. What do you think? Worthwhile content?

Also, this is just one of the things that will be covered in our SSIS Design Patterns precon. The rest of the author team — Andy Leonard, Matt Masson, Tim Mitchell, and Jessica Moss — is putting together great content that includes package execution, parent/child relationships, expressions and scripting, dynamic configurations, error handling, data flow internals, scalability and parallelism, automation, dynamic package generation, data warehousing patterns, and more!

As an aside, I am honored to announce that I have been re-awarded as Microsoft MVP for SQL Server. My husband took this picture of my trophy for me, which I think turned out really cool. :)

SQL Agent Job Script

I’ve been working on a script for a demo in my upcoming 24 Hours of PASS presentation, which I thought I’d share on my blog. In the presentation, I use this script to demonstrate how to execute one or more queries dynamically against a list of SQL Server instances.

The script itself explores SQL Agent Job metadata to get job statuses — when the job last ran, when it will run again, an aggregate count of the number of successful and failed executions in the queried time period, T-SQL code to disable the job, etc. I’ve only tested this in a handful of DEV and PROD environments, so please tell me if you experience any issues in your environment when running the script. As with most of my scripts, this will only work on SQL Server 2005 and newer.

I was planning to send the SQL Server development team a case of #bacon had they finally fixed the sysjob% date and time columns in SQL Server 2012, but alas, it seems they shall sadly remain pork-free. Credit for the logic that handles the time conversion goes to some awesome community member who posted it on the MSDN forum. I wish I had the URL, but it was a long time ago.

DECLARE @jobHistory TABLE
(
      job_id                UNIQUEIDENTIFIER
    , success               INT
    , cancel                INT
    , fail                  INT
    , retry                 INT
    , last_execution_id     INT
    , last_duration         CHAR(8)
    , last_execution_start  DATETIME
);
 
WITH lastExecution
AS
(
    SELECT job_id
    , MAX(instance_id) AS last_instance_id
FROM msdb.dbo.sysjobhistory
WHERE step_id = 0
GROUP BY job_id
)
 
INSERT INTO @jobHistory
SELECT sjh.job_id
    , SUM(CASE WHEN sjh.run_status = 1 AND step_id = 0 THEN 1 ELSE 0 END) AS success
    , SUM(CASE WHEN sjh.run_status = 3 AND step_id = 0 THEN 1 ELSE 0 END) AS cancel
    , SUM(CASE WHEN sjh.run_status = 0 AND step_id = 0 THEN 1 ELSE 0 END) AS fail
    , SUM(CASE WHEN sjh.run_status = 2 THEN 1 ELSE 0 END) AS retry
    , MAX(CASE WHEN sjh.step_id = 0 THEN instance_id ELSE NULL END) last_execution_id
    , SUBSTRING(CAST(MAX(CASE WHEN le.job_id IS NOT NULL THEN sjh.run_duration ELSE NULL END) + 1000000 AS VARCHAR(7)),2,2) + ':' 
            + SUBSTRING(CAST(MAX(CASE WHEN le.job_id IS NOT NULL THEN sjh.run_duration ELSE NULL END) + 1000000 AS VARCHAR(7)),4,2) + ':' 
            + SUBSTRING(CAST(MAX(CASE WHEN le.job_id IS NOT NULL THEN sjh.run_duration ELSE NULL END) + 1000000 AS VARCHAR(7)),6,2)
            AS last_duration
    , MAX(CASE WHEN le.last_instance_id IS NOT NULL THEN 
        CONVERT(datetime, RTRIM(run_date)) 
        + ((run_time / 10000 *  3600) 
        + ((run_time % 10000) / 100 * 60) 
        + (run_time  % 10000) % 100) / (86399.9964)
      ELSE '1900-01-01' END) AS last_execution_start 
FROM msdb.dbo.sysjobhistory AS sjh
LEFT JOIN lastExecution     AS le
    ON sjh.job_id = le.job_id
   AND sjh.instance_id = le.last_instance_id
GROUP BY sjh.job_id;
 
/* We need to parse the schedule into something we can understand */
DECLARE @weekDay TABLE (
      mask          INT
    , maskValue     VARCHAR(32)
);
 
INSERT INTO @weekDay
SELECT 1, 'Sunday'      UNION ALL
SELECT 2, 'Monday'      UNION ALL
SELECT 4, 'Tuesday'     UNION ALL
SELECT 8, 'Wednesday'   UNION ALL
SELECT 16, 'Thursday'   UNION ALL
SELECT 32, 'Friday'     UNION ALL
SELECT 64, 'Saturday';
 
 
/* Now let's get our schedule information */
WITH myCTE
AS(
    SELECT sched.name AS 'scheduleName'
        , sched.schedule_id
        , jobsched.job_id
        , CASE 
            WHEN sched.freq_type = 1 
                THEN 'Once' 
            WHEN sched.freq_type = 4 
                AND sched.freq_interval = 1 
                    THEN 'Daily'
            WHEN sched.freq_type = 4 
                THEN 'Every ' + CAST(sched.freq_interval AS VARCHAR(5)) + ' days'
            WHEN sched.freq_type = 8 THEN 
                REPLACE( REPLACE( REPLACE(( 
                    SELECT maskValue 
                    FROM @weekDay AS x 
                    WHERE sched.freq_interval & x.mask <> 0 
                    ORDER BY mask FOR XML RAW)
                , '"/><row maskValue="', ', '), '<row maskValue="', ''), '"/>', '') 
                + CASE 
                    WHEN sched.freq_recurrence_factor <> 0 
                        AND sched.freq_recurrence_factor = 1 
                            THEN '; weekly' 
                    WHEN sched.freq_recurrence_factor <> 0 
                        THEN '; every ' 
                + CAST(sched.freq_recurrence_factor AS VARCHAR(10)) + ' weeks' END
            WHEN sched.freq_type = 16 THEN 'On day ' 
                + CAST(sched.freq_interval AS VARCHAR(10)) + ' of every '
                + CAST(sched.freq_recurrence_factor AS VARCHAR(10)) + ' months' 
            WHEN sched.freq_type = 32 THEN 
                CASE 
                    WHEN sched.freq_relative_interval = 1 THEN 'First'
                    WHEN sched.freq_relative_interval = 2 THEN 'Second'
                    WHEN sched.freq_relative_interval = 4 THEN 'Third'
                    WHEN sched.freq_relative_interval = 8 THEN 'Fourth'
                    WHEN sched.freq_relative_interval = 16 THEN 'Last'
                END + 
                CASE 
                    WHEN sched.freq_interval = 1 THEN ' Sunday'
                    WHEN sched.freq_interval = 2 THEN ' Monday'
                    WHEN sched.freq_interval = 3 THEN ' Tuesday'
                    WHEN sched.freq_interval = 4 THEN ' Wednesday'
                    WHEN sched.freq_interval = 5 THEN ' Thursday'
                    WHEN sched.freq_interval = 6 THEN ' Friday'
                    WHEN sched.freq_interval = 7 THEN ' Saturday'
                    WHEN sched.freq_interval = 8 THEN ' Day'
                    WHEN sched.freq_interval = 9 THEN ' Weekday'
                    WHEN sched.freq_interval = 10 THEN ' Weekend'
                END
                + CASE 
                    WHEN sched.freq_recurrence_factor <> 0 
                        AND sched.freq_recurrence_factor = 1 
                            THEN '; monthly'
                    WHEN sched.freq_recurrence_factor <> 0 
                        THEN '; every ' 
                + CAST(sched.freq_recurrence_factor AS VARCHAR(10)) + ' months' 
                  END
            WHEN sched.freq_type = 64   THEN 'StartUp'
            WHEN sched.freq_type = 128  THEN 'Idle'
          END AS 'frequency'
        , ISNULL('Every ' + CAST(sched.freq_subday_interval AS VARCHAR(10)) + 
            CASE 
                WHEN sched.freq_subday_type = 2 THEN ' seconds'
                WHEN sched.freq_subday_type = 4 THEN ' minutes'
                WHEN sched.freq_subday_type = 8 THEN ' hours'
            END, 'Once') AS 'subFrequency'
        , REPLICATE('0', 6 - LEN(sched.active_start_time)) 
            + CAST(sched.active_start_time AS VARCHAR(6)) AS 'startTime'
        , REPLICATE('0', 6 - LEN(sched.active_end_time)) 
            + CAST(sched.active_end_time AS VARCHAR(6)) AS 'endTime'
        , REPLICATE('0', 6 - LEN(jobsched.next_run_time)) 
            + CAST(jobsched.next_run_time AS VARCHAR(6)) AS 'nextRunTime'
        , CAST(jobsched.next_run_date AS CHAR(8)) AS 'nextRunDate'
    FROM msdb.dbo.sysschedules      AS sched
    JOIN msdb.dbo.sysjobschedules   AS jobsched
        ON sched.schedule_id = jobsched.schedule_id
    WHERE sched.enabled = 1
)
 
/* Finally, let's look at our actual jobs and tie it all together */
SELECT CONVERT(NVARCHAR(128), SERVERPROPERTY('Servername'))             AS [serverName]
    , job.job_id                                                        AS [jobID]
    , job.name                                                          AS [jobName]
    , CASE WHEN job.enabled = 1 THEN 'Enabled' ELSE 'Disabled' END      AS [jobStatus]
    , COALESCE(sched.scheduleName, '(unscheduled)')                     AS [scheduleName]
    , COALESCE(sched.frequency, '')                                     AS [frequency]
    , COALESCE(sched.subFrequency, '')                                  AS [subFrequency]
    , COALESCE(SUBSTRING(sched.startTime, 1, 2) + ':' 
        + SUBSTRING(sched.startTime, 3, 2) + ' - ' 
        + SUBSTRING(sched.endTime, 1, 2) + ':' 
        + SUBSTRING(sched.endTime, 3, 2), '')                           AS [scheduleTime] -- HH:MM
    , COALESCE(SUBSTRING(sched.nextRunDate, 1, 4) + '/' 
        + SUBSTRING(sched.nextRunDate, 5, 2) + '/' 
        + SUBSTRING(sched.nextRunDate, 7, 2) + ' ' 
        + SUBSTRING(sched.nextRunTime, 1, 2) + ':' 
        + SUBSTRING(sched.nextRunTime, 3, 2), '')                       AS [nextRunDate]
      /* Note: the sysjobschedules table refreshes every 20 min, so nextRunDate may be out of date */
    , COALESCE(jh.success, 0)                                           AS [success]
    , COALESCE(jh.cancel, 0)                                            AS [cancel]
    , COALESCE(jh.fail, 0)                                              AS [fail]
    , COALESCE(jh.retry, 0)                                             AS [retry]
    , COALESCE(jh.last_execution_id, 0)                                 AS [lastExecutionID]
    , jh.last_execution_start                                           AS [lastExecutionStart]
    , COALESCE(jh.last_duration, '00:00:01')                            AS [lastDuration]
    , 'EXECUTE msdb.dbo.sp_update_job @job_id = ''' 
        + CAST(job.job_id AS CHAR(36)) + ''', @enabled = 0;'            AS [disableSQLScript]
FROM msdb.dbo.sysjobs               AS job
LEFT JOIN myCTE                     AS sched
    ON job.job_id = sched.job_id
LEFT JOIN @jobHistory               AS jh
    ON job.job_id = jh.job_id
WHERE job.enabled = 1 -- do not display disabled jobs
    --AND jh.last_execution_start >= DATEADD(day, -1, GETDATE()) /* Pull just the last 24 hours */
ORDER BY nextRunDate;

Example Output:

serverName       jobID                                 jobName                      jobStatus scheduleName                      frequency  subFrequency   scheduleTime   nextRunDate       success  cancel  fail  retry  lastExecutionID  lastExecutionStart       lastDuration  disableSQLScript
---------------- ------------------------------------- ---------------------------- --------- --------------------------------- ---------- -------------- -------------- ----------------- -------- ------- ----- ------ ---------------- ------------------------ ------------- ----------------------------------------------------------------------------------------------
SQLFOOL\SQL2012  180BFD8E-AE0C-44F8-992E-27522611992A  DW FactOrders Update         Enabled   Every 4 Hours                     Daily      Every 4 hours  00:00 - 23:59  2012/09/19 20:00  12       0       0     0      84               2012-09-17 14:00:01.000  00:00:04      EXECUTE msdb.dbo.sp_update_job @job_id = '180BFD8E-AE0C-44F8-992E-27522611992A', @enabled = 0;
SQLFOOL\SQL2012  3470C9E5-A2CD-454A-89A1-DEF55FF186D3  SSIS Server Maintenance Job  Enabled   SSISDB Scheduler                  Daily      Once           00:00 - 12:00  2012/09/20 00:00  3        0       0     0      68               2012-09-17 11:50:51.000  00:00:01      EXECUTE msdb.dbo.sp_update_job @job_id = '3470C9E5-A2CD-454A-89A1-DEF55FF186D3', @enabled = 0;
SQLFOOL\SQL2012  F965B24D-60EB-4B95-91B6-C7D66057A883  syspolicy_purge_history      Enabled   syspolicy_purge_history_schedule  Daily      Once           02:00 - 23:59  2012/09/20 02:00  3        0       1     1      70               2012-09-17 11:50:51.000  00:01:24      EXECUTE msdb.dbo.sp_update_job @job_id = 'F965B24D-60EB-4B95-91B6-C7D66057A883', @enabled = 0;

See you in Seattle!

My Summit abstract was accepted! I’m still a little surprised, but I’m also excited (okay, and a little nervous) to once more be presenting at the PASS Summit. If you’ll be at Summit this year — and I really hope you are, as it’s well worth the time and cost — then please make sure to say “hi” if you see me wandering around. Aside from the *excellent* content, my favorite thing about Summit is getting to meet so many great people. :)

In other news, I’ve once more switched roles within GoDaddy. For the half dozen folks who’ve been following my blog from the beginning, you may remember that I originally started out on the traffic team working with tuning and VLDB’s, then took an opportunity to switch to the BI team to learn more about OLAP. Recently, a new team has been formed under the BI branch that’s tasked with developing a massive hybrid data warehouse (by hybrid, I mean half OLTP and half OLAP). “How massive is it?” Well, it’s SO massive, we’re expecting to be store petabytes of data when everything is said and done. I’m happy to say I’ll be on this new team. So yes, that means we have an opening for an OLAP developer. We’re also hiring SQL Server DBA’s. We have offices in Cedar Rapids, Denver, and the Phoenix area. Send me an e-mail at michelle at sqlfool dot com if you’re interested in learning more about this great job opportunity and company.

Lastly, I want to announce that SQL Saturday 50 is now open for registration! SQL Saturday 50 will be held in Iowa City, IA on Saturday, September 18th. We’re almost at 50% of our attendance capacity, so if you’re interested in attending, please register soon.

That’s it for now. I promise that my next blog post will be uber technical. :)

SQL Saturday #50 – Call for Speakers

The Call for Speakers is now open for SQL Saturday #50, the East Iowa SQL Saturday event! This is our second time hosting a SQL Saturday, and we’re hoping to build upon the success of last year’s event. We’re looking for a wide variety of topics on SQL Server and related technologies (i.e. PowerShell, R2, LINQ, etc.). We also have had several requests for intro-level topics, such as beginning disaster recovery and basic performance tuning. If you’re even remotely thinking about speaking, please submit an abstract!

Last year we had about 100 folks attend from surrounding areas. This year, we’re shooting for 125 attendees, which would max out our facility’s capacity. Not sure how far away Iowa City is? It may be closer than you think. Allow me to rehash my travel times from last year’s plea for speakers:

  • Chicago – 3.5 hours
  • Omaha – 3.5 hours
  • Milwaukee – 4 hours
  • Kansas City – 4.5 hours
  • Minneapolis – 5 hours
  • St. Louis – 5 hours
  • Indianapolis – 6 hours

The event will be held on September 18th at the University of Iowa in Iowa City. You can find more information, including an abstract submission form, on our event website at http://sqlsaturday.com/50/eventhome.aspx.

Oh, and if you do make it to our SQL Saturday event, please make sure to stop me and say “hi!” :)

Summit 2010 Abstract Submission

Jeremiah Peschka just tweeted about the looming deadline to submit an abstract for Summit 2010. I’ve been trying to think of a good topic to present on, and this finally got my butt in gear to submit one! For those interested, here’s what I submitted:

Heaps of Trouble, Clusters of Glory – A Look At Index Internals

Indexes are a crucial component of SQL Server, especially in performance tuning, yet many DBA’s don’t fully understand how indexes work. In this in-depth session, Michelle will examine the anatomy of indexes, from how they’re stored to how they get fragmented. And to make our inner geeks happy, she’ll look at page data to show you what’s happening behind the scenes. Topics covered will include index structure, fragmentation and defragmentation, index partitioning, and index filtering. If you’ve ever wondered just *what* goes on in an index, don’t miss this session!

Special thanks to Chris for help with the session title. :)

Why I’m Blogging Less

I’ve received a few questions asking why I’ve been blogging less frequently, and even one inquiry after my health. Rest assured, I’m completely fine. But there are 2 perfectly good reasons why I’ve been blogging less these days.

East Iowa SQL Saturday:

I’m the event organizer for East Iowa SQL Saturday, which is eating up a lot of my free time. If you haven’t yet heard about our SQL Saturday event, let me give you a brief overview. It’s a FREE, one-day training event geared toward SQL Server professionals and anyone who wants to learn more about SQL Server. We have 22 sessions planned covering a variety of topics, from Business Intelligence to Disaster Recovery to SQL Server 2008 topics. And if you’re a .NET developer, we also have some .NET-related presentations, including PowerShell and MVC.

We’re very fortunate to have snagged an excellent set of speakers. Jessica Moss, Louis Davidson, Timothy Ford, Jason Strate, and Alex Kuznetsov are just a few of the great speakers we have lined up.

There’s only a handful of spots left, so if you’re interested in attending, you should register soon. To find out more details about the speakers and sessions, or to register, be sure to check out our website at http://sqlsaturday.380pass.org.

The Other Reason:

baby_uff

Yes, that’s right, I’m with child. Expecting. Eating for two. Bun in the oven. In the family way. You get the idea.

So when I’m not at work, planning SQL Saturday, or playing Civilization Revolution, I’m sleeping. For those who remotely care, I’m due around Super Bowl time in February 2010.

2010: The Year I Make Contact

2010: The Year I Make Contact

Rest assured, this blog isn’t going away. And hopefully once I get through SQL Saturday and then PASS Summit, I’ll have more free time again. :)

PASS Summit 2009

My abstract for PASS Summit 2009 was accepted! Woot! You may not be able to see it from where you’re sitting, but I’m doing the happy dance. :)

In case you missed my original post on my abstract submission, here’s what I’ll be presenting on:

Super Bowl, Super Load – A Look at Performance Tuning for VLDB’s

Few DBA’s have the opportunity to experience a real-life load test in their production environment. Michelle Ufford works for GoDaddy.com, a company that has experienced phenomenal success with its Super Bowl ads. These ads are designed to drive traffic to the company’s websites, which puts the database servers under high load. In her presentation, Michelle will explore the performance tuning techniques that have resulted in an 80% reduction in server response times and allowed her VLDB’s to reach rates of 27k transactions per second. Topics will include vertical and horizontal partitioning, bulk operations, table design, and indexing.

Do you read my blog? Do I read yours? Do we exchange weird messages on Twitter? Do you have free cookies? If you’re going to to the PASS Summit and answered “yes” to any of these questions, then I want to meet you! Make sure to say “hi” to me in Seattle. :)

East Iowa SQL Saturday – Call for Speakers!

The Call for Speakers is now open for the East Iowa SQL Saturday! This is our first time hosting a SQL Saturday, and there’s a lot of excitement and interest from our local SQL Server folks. There’s some interest from local speakers, but we’ll probably also need to pull in speakers from outside of Cedar Rapids and Iowa City to support the number of tracks and attendees we plan to have.

The event will be held on October 10th, 2009 at the University of Iowa in Iowa City. Special thanks to Russ Allen for his help with securing a location for us!

For those who are not in the immediate area but would be willing to travel, here’s some general travel times from major cities in the area:

  • Chicago – 3.5 hours
  • Omaha – 3.5 hours
  • Milwaukee – 4 hours
  • Kansas City – 4.5 hours
  • Minneapolis – 5 hours
  • St. Louis – 5 hours
  • Indianapolis – 6 hours
  • Columbus, OH – 9 hours (hint, hint, Jeremiah!)

So if you’re in the general area, please, PLEASE consider speaking at our SQL Saturday. Pretty please with sugar on top. :)

You can find out a little more information and submit sessions at our SQL Saturday website.