Today I had to run an ad hoc query on a 8.5 billion row table. The table had a dozen columns of a variety of data types and was clustered on a bigint identity. There were no other indexes on the table. My query involved a join to a smaller table with a date range restriction. Without an adequate index to use, SQL Server was going to be forced to scan this 8.5 billion row table. Now, I don’t have much patience for waiting for long running queries. I want to run the ad hoc, e-mail the results, and forget about it. But short of adding a nonclustered index, which would take a very long time to build and probably require additional space requisitioned from the SAN team, what could I do? Enter disposable indexes. Now, you might be asking yourself, “What the frilly heck does she mean by a disposable index? Is that new in Denali?” No, dear reader. I am actually referring to filtered indexes, which is available in SQL Server 2008 and 2008 R2. I call them “disposable” because I create them to significantly speed up ad hoc queries, then I drop them when I’m done.
Here, allow me to demonstrate using the AdventureWorks2008R2 database. Although the tables are smaller, this query is very similar in structure to what I needed to run today.
Select Count(Distinct sod.SalesOrderID) As 'distinctCount' From AdventureWorks2008R2.Sales.SalesOrderDetail As sod Join AdventureWorks2008R2.Production.Product As p On sod.ProductID = p.ProductID Where sod.ModifiedDate Between '2008-01-01' And '2008-07-31' And p.MakeFlag = 0;
Now, let’s take a look at the type of indexes we currently have available:
Select name, has_filter, filter_definition From sys.indexes Where object_id = object_id('Sales.SalesOrderDetail');
name has_filter filter_definition ----------------------------------------------------------------------------------------- PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID 0 NULL AK_SalesOrderDetail_rowguid 0 NULL IX_SalesOrderDetail_ProductID 0 NULL (3 row(s) affected)
We need an index on ModifiedDate and ProductID, which it doesn’t look like we have currently. Without this, we’re going to end up scanning on the clustered index. That means SQL Server will have to evaluate each and every single row in the table to see if the row matches the criteria of our query. Not pretty, and certainly not fast. So instead, let’s create a filtered index on date. But we can greatly speed up the time it takes to create our filtered index by doing a little investigating upfront and finding a range of clustering key values that will cover the query. Doing this allows SQL Server to seek on the clustered index, greatly reducing the amount of reads necessary to create our filtered index. So let’s see this in action. First, let’s find out the current max value of the table:
Select Max(SalesOrderDetailID) As 'maxID' From AdventureWorks2008R2.Sales.SalesOrderDetail;
maxID ----------- 121317
Now we get to do a little guessing. Let’s go back and see what date we get if we look at half of the records:
Select SalesOrderDetailID, ModifiedDate From AdventureWorks2008R2.Sales.SalesOrderDetail Where SalesOrderDetailID = (121317/2);
SalesOrderDetailID ModifiedDate ------------------ ----------------------- 60658 2007-11-01 00:00:00.000
Okay, SalesOrderDetailID 60658 gets us back to 11/1/2007. That’s a little too far. Let’s see how a SalesOrderDetailID value of 75000 does…
Select SalesOrderDetailID, ModifiedDate From AdventureWorks2008R2.Sales.SalesOrderDetail Where SalesOrderDetailID = 75000;
SalesOrderDetailID ModifiedDate ------------------ ----------------------- 75000 2007-12-27 00:00:00.000
Okay, SalesOrderDetailID 75000 takes us back to 12/27/2007. That’s close enough to 1/1/2008 for my purposes. Of course, depending on the size of the table, in real life it may make sense to try to get closer to the value you’re looking for. But for now, this will do. And because we’re looking for data through the “current date” (7/31/2008 in the AdventureWorks2008R2 database), we already know our outer limit is 121317.
Now let’s take these ranges and create a filtered index that will cover our query:
Create Nonclustered Index IX_SalesOrderDetail_filtered On Sales.SalesOrderDetail(ModifiedDate, ProductID) Include (SalesOrderID) Where SalesOrderDetailID >= 75000 And SalesOrderDetailID < 121317;
By having this range identified, SQL Server can perform a seek on the clustered index to create the nonclustered index on just the subset of records that you need for your query. Remember that 8.5 billion row table I mentioned earlier? I was able to create a filtered index that covered my query in 10 seconds. Yes, that’s right… 10 SECONDS.
The last thing we need to do is include our filtered index definition in our ad hoc query to ensure that the filtered index is used. It also doesn’t hurt to explicitly tell SQL Server to use the filtered index if you’re absolutely sure it’s the best index for the job.
Select Count(Distinct sod.SalesOrderID) As 'distinctCount' From AdventureWorks2008R2.Sales.SalesOrderDetail As sod With (Index(IX_SalesOrderDetail_filtered)) Join AdventureWorks2008R2.Production.Product As p On sod.ProductID = p.ProductID Where sod.ModifiedDate Between '2008-01-01' And '2008-07-31' And p.MakeFlag = 0 And sod.SalesOrderDetailID >= 75000 And sod.SalesOrderDetailID < 121317;
That’s all there is to it. Using this method, I was able to complete my ad hoc request in 40 seconds: 10 seconds to create the filtered index and 30 seconds to actually execute the ad hoc. Of course, it also took a couple of minutes to write the query, look at existing indexes, and search for the correct identity values. All in all, from the time I received the request to the time I send the e-mail was about 5 minutes. All because of disposable filtered indexes. How’s that for some SQL #awesomesauce?