Today I had to run an ad hoc query on a 8.5 billion row table. The table had a dozen columns of a variety of data types and was clustered on a bigint identity. There were no other indexes on the table. My query involved a join to a smaller table with a date range restriction. Without an adequate index to use, SQL Server was going to be forced to scan this 8.5 billion row table. Now, I don’t have much patience for waiting for long running queries. I want to run the ad hoc, e-mail the results, and forget about it. But short of adding a nonclustered index, which would take a very long time to build and probably require additional space requisitioned from the SAN team, what could I do? Enter disposable indexes. Now, you might be asking yourself, “What the frilly heck does she mean by a disposable index? Is that new in Denali?” No, dear reader. I am actually referring to filtered indexes, which is available in SQL Server 2008 and 2008 R2. I call them “disposable” because I create them to significantly speed up ad hoc queries, then I drop them when I’m done.
Here, allow me to demonstrate using the AdventureWorks2008R2 database. Although the tables are smaller, this query is very similar in structure to what I needed to run today.
Select Count(Distinct sod.SalesOrderID) As 'distinctCount' From AdventureWorks2008R2.Sales.SalesOrderDetail As sod Join AdventureWorks2008R2.Production.Product As p On sod.ProductID = p.ProductID Where sod.ModifiedDate Between '2008-01-01' And '2008-07-31' And p.MakeFlag = 0; |
Now, let’s take a look at the type of indexes we currently have available:
Select name, has_filter, filter_definition From sys.indexes Where object_id = object_id('Sales.SalesOrderDetail'); |
name has_filter filter_definition ----------------------------------------------------------------------------------------- PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID 0 NULL AK_SalesOrderDetail_rowguid 0 NULL IX_SalesOrderDetail_ProductID 0 NULL (3 row(s) affected) |
We need an index on ModifiedDate and ProductID, which it doesn’t look like we have currently. Without this, we’re going to end up scanning on the clustered index. That means SQL Server will have to evaluate each and every single row in the table to see if the row matches the criteria of our query. Not pretty, and certainly not fast. So instead, let’s create a filtered index on date. But we can greatly speed up the time it takes to create our filtered index by doing a little investigating upfront and finding a range of clustering key values that will cover the query. Doing this allows SQL Server to seek on the clustered index, greatly reducing the amount of reads necessary to create our filtered index. So let’s see this in action. First, let’s find out the current max value of the table:
Select Max(SalesOrderDetailID) As 'maxID' From AdventureWorks2008R2.Sales.SalesOrderDetail; |
maxID ----------- 121317 |
Now we get to do a little guessing. Let’s go back and see what date we get if we look at half of the records:
Select SalesOrderDetailID, ModifiedDate From AdventureWorks2008R2.Sales.SalesOrderDetail Where SalesOrderDetailID = (121317/2); |
SalesOrderDetailID ModifiedDate ------------------ ----------------------- 60658 2007-11-01 00:00:00.000 |
Okay, SalesOrderDetailID 60658 gets us back to 11/1/2007. That’s a little too far. Let’s see how a SalesOrderDetailID value of 75000 does…
Select SalesOrderDetailID, ModifiedDate From AdventureWorks2008R2.Sales.SalesOrderDetail Where SalesOrderDetailID = 75000; |
SalesOrderDetailID ModifiedDate ------------------ ----------------------- 75000 2007-12-27 00:00:00.000 |
Okay, SalesOrderDetailID 75000 takes us back to 12/27/2007. That’s close enough to 1/1/2008 for my purposes. Of course, depending on the size of the table, in real life it may make sense to try to get closer to the value you’re looking for. But for now, this will do. And because we’re looking for data through the “current date” (7/31/2008 in the AdventureWorks2008R2 database), we already know our outer limit is 121317.
Now let’s take these ranges and create a filtered index that will cover our query:
Create Nonclustered Index IX_SalesOrderDetail_filtered On Sales.SalesOrderDetail(ModifiedDate, ProductID) Include (SalesOrderID) Where SalesOrderDetailID >= 75000 And SalesOrderDetailID < 121317; |
By having this range identified, SQL Server can perform a seek on the clustered index to create the nonclustered index on just the subset of records that you need for your query. Remember that 8.5 billion row table I mentioned earlier? I was able to create a filtered index that covered my query in 10 seconds. Yes, that’s right… 10 SECONDS.
The last thing we need to do is include our filtered index definition in our ad hoc query to ensure that the filtered index is used. It also doesn’t hurt to explicitly tell SQL Server to use the filtered index if you’re absolutely sure it’s the best index for the job.
Select Count(Distinct sod.SalesOrderID) As 'distinctCount' From AdventureWorks2008R2.Sales.SalesOrderDetail As sod With (Index(IX_SalesOrderDetail_filtered)) Join AdventureWorks2008R2.Production.Product As p On sod.ProductID = p.ProductID Where sod.ModifiedDate Between '2008-01-01' And '2008-07-31' And p.MakeFlag = 0 And sod.SalesOrderDetailID >= 75000 And sod.SalesOrderDetailID < 121317; |
That’s all there is to it. Using this method, I was able to complete my ad hoc request in 40 seconds: 10 seconds to create the filtered index and 30 seconds to actually execute the ad hoc. Of course, it also took a couple of minutes to write the query, look at existing indexes, and search for the correct identity values. All in all, from the time I received the request to the time I send the e-mail was about 5 minutes. All because of disposable filtered indexes. How’s that for some SQL #awesomesauce? 🙂
Hey Michelle,
I also encountered a similar problem, and here is my take at it:
http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/07/13/optimizing-yet-another-query-that-involves-highly-correlated-columns.aspx
What do you think?
Seems like a really neat solution when you have a guarantee that column values outside the identity range chosen are guaranteed to not have values you are restricting against. Need to add this to my bag of tricks…
Nice!
Will really come in useful.
Thanks
I’m just impressed. I think I have to upgrade my SQL2k to 2k8+… (hehe)
I also have a similar problem, but the index creation fails…
Let’s see this in work after I’ve upgraded…
Thanks.
Awesome! If you don’t mind me asking, how many drives support this db with 8.5 billion rows and how many partitions does it have?
The data sits on a SAN. I’m not sure how the SAN is architected. We do have a weekly partitioning scheme with an annual filegroup strategy. Hope that helps. 🙂
Thanks Michelle! You have a great blog with excellent info!
Great entry, thanks!
Michelle, great blog, but I have a question: Why not just add a term to the original query’s WHERE clause, such as: AND SalesOrderID > 75000 AND SalesOrderID < 121317 ?
It seems to me you can take advantage of your knowledge about the relationship between SalesOrderID and ModifiedDate directly in the original query, and skip the index altogether.
Am I missing something?
Hi Colin,
Good question! It really depends on the type of ad hocs, and whether or not you can hit the table just once (i.e. do you KNOW what you need, or are you doing more investigative work?). But yes, you’re exactly right… in some cases you can just skip the filtered index altogether.
Thanks for reading. 🙂
Michelle
I have been working with SQL Server since 2002 and have come across various professional bloggers dedicated to SQL server, this is on such blog which has very practical scenrios discussed and explained in detail, not many organisation ask questions such as Data Utilisation etc, which are quite critical now and this blog does it quite effortlessly. Kudos! and please add me to your fan base. Look forward to more such real world tips and tricks.