Disposable Indexes

Today I had to run an ad hoc query on a 8.5 billion row table. The table had a dozen columns of a variety of data types and was clustered on a bigint identity. There were no other indexes on the table. My query involved a join to a smaller table with a date range restriction. Without an adequate index to use, SQL Server was going to be forced to scan this 8.5 billion row table. Now, I don’t have much patience for waiting for long running queries. I want to run the ad hoc, e-mail the results, and forget about it. But short of adding a nonclustered index, which would take a very long time to build and probably require additional space requisitioned from the SAN team, what could I do? Enter disposable indexes. Now, you might be asking yourself, “What the frilly heck does she mean by a disposable index? Is that new in Denali?” No, dear reader. I am actually referring to filtered indexes, which is available in SQL Server 2008 and 2008 R2. I call them “disposable” because I create them to significantly speed up ad hoc queries, then I drop them when I’m done.

Here, allow me to demonstrate using the AdventureWorks2008R2 database. Although the tables are smaller, this query is very similar in structure to what I needed to run today.

Select Count(Distinct sod.SalesOrderID) As 'distinctCount'
From AdventureWorks2008R2.Sales.SalesOrderDetail As sod
Join AdventureWorks2008R2.Production.Product As p
    On sod.ProductID = p.ProductID
Where sod.ModifiedDate Between '2008-01-01' And '2008-07-31'
    And p.MakeFlag = 0;

Now, let’s take a look at the type of indexes we currently have available:

Select name, has_filter, filter_definition
From sys.indexes
Where object_id = object_id('Sales.SalesOrderDetail');
name                                                    has_filter filter_definition
-----------------------------------------------------------------------------------------
PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID     0          NULL
AK_SalesOrderDetail_rowguid                             0          NULL
IX_SalesOrderDetail_ProductID                           0          NULL
 
(3 row(s) affected)

We need an index on ModifiedDate and ProductID, which it doesn’t look like we have currently. Without this, we’re going to end up scanning on the clustered index. That means SQL Server will have to evaluate each and every single row in the table to see if the row matches the criteria of our query. Not pretty, and certainly not fast. So instead, let’s create a filtered index on date. But we can greatly speed up the time it takes to create our filtered index by doing a little investigating upfront and finding a range of clustering key values that will cover the query. Doing this allows SQL Server to seek on the clustered index, greatly reducing the amount of reads necessary to create our filtered index. So let’s see this in action. First, let’s find out the current max value of the table:

Select Max(SalesOrderDetailID) As 'maxID'
From AdventureWorks2008R2.Sales.SalesOrderDetail;
maxID
-----------
121317

Now we get to do a little guessing. Let’s go back and see what date we get if we look at half of the records:

Select SalesOrderDetailID, ModifiedDate
From AdventureWorks2008R2.Sales.SalesOrderDetail
Where SalesOrderDetailID = (121317/2);
SalesOrderDetailID ModifiedDate
------------------ -----------------------
60658              2007-11-01 00:00:00.000

Okay, SalesOrderDetailID 60658 gets us back to 11/1/2007. That’s a little too far. Let’s see how a SalesOrderDetailID value of 75000 does…

Select SalesOrderDetailID, ModifiedDate
From AdventureWorks2008R2.Sales.SalesOrderDetail
Where SalesOrderDetailID = 75000;
SalesOrderDetailID ModifiedDate
------------------ -----------------------
75000              2007-12-27 00:00:00.000

Okay, SalesOrderDetailID 75000 takes us back to 12/27/2007. That’s close enough to 1/1/2008 for my purposes. Of course, depending on the size of the table, in real life it may make sense to try to get closer to the value you’re looking for. But for now, this will do. And because we’re looking for data through the “current date” (7/31/2008 in the AdventureWorks2008R2 database), we already know our outer limit is 121317.

Now let’s take these ranges and create a filtered index that will cover our query:

Create Nonclustered Index IX_SalesOrderDetail_filtered
    On Sales.SalesOrderDetail(ModifiedDate, ProductID)
    Include (SalesOrderID)
    Where SalesOrderDetailID >= 75000
      And SalesOrderDetailID <  121317;

By having this range identified, SQL Server can perform a seek on the clustered index to create the nonclustered index on just the subset of records that you need for your query. Remember that 8.5 billion row table I mentioned earlier? I was able to create a filtered index that covered my query in 10 seconds. Yes, that’s right… 10 SECONDS.

The last thing we need to do is include our filtered index definition in our ad hoc query to ensure that the filtered index is used. It also doesn’t hurt to explicitly tell SQL Server to use the filtered index if you’re absolutely sure it’s the best index for the job.

Select Count(Distinct sod.SalesOrderID) As 'distinctCount'
From AdventureWorks2008R2.Sales.SalesOrderDetail As sod With (Index(IX_SalesOrderDetail_filtered))
Join AdventureWorks2008R2.Production.Product As p
    On sod.ProductID = p.ProductID
Where sod.ModifiedDate Between '2008-01-01' And '2008-07-31'
    And p.MakeFlag = 0
    And sod.SalesOrderDetailID >= 75000
    And sod.SalesOrderDetailID <  121317;

That’s all there is to it. Using this method, I was able to complete my ad hoc request in 40 seconds: 10 seconds to create the filtered index and 30 seconds to actually execute the ad hoc. Of course, it also took a couple of minutes to write the query, look at existing indexes, and search for the correct identity values. All in all, from the time I received the request to the time I send the e-mail was about 5 minutes. All because of disposable filtered indexes. How’s that for some SQL #awesomesauce? :)

0saves
If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.
Tagged , , , . Bookmark the permalink.

11 Responses to Disposable Indexes

  1. Alex Kuznetsov says:

    Hey Michelle,

    I also encountered a similar problem, and here is my take at it:

    http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/07/13/optimizing-yet-another-query-that-involves-highly-correlated-columns.aspx

    What do you think?

  2. TheSQLGuru says:

    Seems like a really neat solution when you have a guarantee that column values outside the identity range chosen are guaranteed to not have values you are restricting against. Need to add this to my bag of tricks…

  3. Jason Campbell says:

    Nice!

    Will really come in useful.

    Thanks

  4. S.E. says:

    I’m just impressed. I think I have to upgrade my SQL2k to 2k8+… (hehe)
    I also have a similar problem, but the index creation fails…
    Let’s see this in work after I’ve upgraded…

    Thanks.

  5. SQL Guy says:

    Awesome! If you don’t mind me asking, how many drives support this db with 8.5 billion rows and how many partitions does it have?

  6. The data sits on a SAN. I’m not sure how the SAN is architected. We do have a weekly partitioning scheme with an annual filegroup strategy. Hope that helps. :)

  7. SQL Guy says:

    Thanks Michelle! You have a great blog with excellent info!

  8. LoveNHateSQL says:

    Great entry, thanks!

  9. Colin Nicholls says:

    Michelle, great blog, but I have a question: Why not just add a term to the original query’s WHERE clause, such as: AND SalesOrderID > 75000 AND SalesOrderID < 121317 ?

    It seems to me you can take advantage of your knowledge about the relationship between SalesOrderID and ModifiedDate directly in the original query, and skip the index altogether.

    Am I missing something?

  10. Hi Colin,

    Good question! It really depends on the type of ad hocs, and whether or not you can hit the table just once (i.e. do you KNOW what you need, or are you doing more investigative work?). But yes, you’re exactly right… in some cases you can just skip the filtered index altogether.

    Thanks for reading. :)

    Michelle

  11. I have been working with SQL Server since 2002 and have come across various professional bloggers dedicated to SQL server, this is on such blog which has very practical scenrios discussed and explained in detail, not many organisation ask questions such as Data Utilisation etc, which are quite critical now and this blog does it quite effortlessly. Kudos! and please add me to your fan base. Look forward to more such real world tips and tricks.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>