Recently, on a conference call presenting data growth rates and database capacity projections, I had a top-line executive ask, “But how much of that data are we actually using today?” The question was met with silence; unless you have rigorous auditing in place — and kudos to you if you do — it’s a difficult question to answer. But it begs the question, is there some way to gleam this information from SQL Server? I think the answer is “yes,” if you make some assumptions and understand what you’re looking at.
SQL Server collects stats about every time an index is used and how it is used (i.e. whether a user seeked or scanned the index, etc.). It also provides a DMV to view these stats: sys.dm_db_index_usage_stats.
This DMV provides a wealth of great information, but to answer our question of “What data is actually being used?”, we have to refine our criteria. Are we talking in terms of table counts or data size? I’d argue that data size is more important than table counts; one unqueried millow-row table is more wasteful than a hundred ten-row tables.
Also, are we looking at indexes or content? From a database perspective, I’m more interested in indexes: how much space are we wasting on unused indexes? To identify this, I need to look at the activity on each individual index.
From a business perspective, I would be more interested in content (i.e. tables): how much business information is being stored that no one is even looking at? To answer this question, I need to roll up all index usage to see if *any* of the indexes on a table were used. Since both were of interest to me, I decided to write queries to answer both questions.
Lastly, we need to understand the flaws with this data. Chiefly, I cannot tell whether a user requested one row from a million-row table, or if [s]he needed all of the data in the table. This is a pretty important issue, especially with large historical data stores, and it’s where I have to make the biggest assumption: if even one person looked at one row in the table, I count all pages in the table as having been accessed.
Now, you may make different decisions than I did above, and that’s fine… each environment and project has different needs. But these assumptions are very important to understanding the output of the query below:
USE master; GO /* This will give you an approximation of how much data is being utilized on a server. Since the data is only valid as of the last server reboot, we should start off with an idea of how much data we've accrued. */ /* Find out when the server was last rebooted */ -- 2008 SELECT sqlserver_start_time FROM sys.dm_os_sys_info; -- 2005 SELECT create_date FROM sys.databases WHERE name = 'tempdb'; /* Create a temporary table to hold our data, since we're going to iterate through databases */ IF OBJECT_ID('tempdb..#Results') IS NOT NULL DROP TABLE #Results; CREATE TABLE #Results ( databaseName NVARCHAR(128) , tableName NVARCHAR(128) , indexID INT , records BIGINT , activity BIGINT , totalPages BIGINT ); /* sp_foreachdb was written by SQL MVP Aaron Bertrand and can be downloaded at http://www.mssqltips.com/sqlservertip/2201/making-a-more-reliable-and-flexible-spmsforeachdb/ Alternatively, you can also use sys.sp_MSforeachdb */ --EXECUTE master.dbo.sp_foreachdb EXECUTE sys.sp_MSforeachdb ' USE ?; -- You can gleam a lot of information about historical data usage from partitions -- but for now, we will just roll up any partitions we may have WITH myCTE AS ( SELECT p.[object_id] AS objectID , p.index_id , SUM(p.[rows]) AS records , SUM(au.total_pages) AS totalPages FROM sys.partitions AS p WITH (NOLOCK) JOIN sys.allocation_units AS au WITH (NOLOCK) ON p.hobt_id = au.container_id GROUP BY p.[object_id] , p.index_id ) -- Grab all tables and join to our usage stats DMV INSERT INTO #Results SELECT DB_NAME() AS databaseName , t.name , x.index_id , MAX(x.records) AS records , ISNULL(SUM(us.user_lookups + us.user_scans + us.user_seeks), 0) AS activity , SUM(x.totalPages) AS totalPages FROM sys.tables AS t WITH (NOLOCK) JOIN myCTE AS x ON t.[object_id] = x.objectID LEFT JOIN sys.dm_db_index_usage_stats AS us WITH (NOLOCK) ON us.[object_id] = x.objectID AND us.index_id = x.index_id AND us.database_id = DB_ID() GROUP BY t.name , x.index_id;' /* Because we're looping through databases, make sure we're not missing any major ones */ SELECT * FROM sys.databases WHERE name NOT IN (SELECT databaseName FROM #Results); /* Retrieve actual % data utilization, which is performed at the index level */ SELECT databaseName , SUM(queriedPages) AS TotalQueriedPages , SUM(totalPages) AS TotalPages , CAST(SUM(queriedPages) AS FLOAT) / REPLACE(SUM(totalPages), 0, 1) AS '%DataUtil' FROM ( SELECT databaseName , tableName , indexID , CASE -- If we have any activity at all on an index, count it as activity WHEN activity = 0 THEN 0.0 ELSE totalPages END AS queriedPages , totalPages FROM #Results WHERE databaseName NOT IN ('master', 'tempdb', 'msdb', 'model') ) x GROUP BY databaseName ORDER BY databaseName; /* Retrieve % content utilization, which is performed at the table level */ SELECT databaseName , SUM(queriedPages) AS TotalQueriedPages , SUM(totalPages) AS TotalPages , CAST(SUM(queriedPages) AS FLOAT) / REPLACE(SUM(totalPages), 0, 1) AS '%ContentUtil' FROM ( SELECT databaseName , tableName , MAX(records) AS records , CASE WHEN SUM(activity) > 0 THEN SUM(totalPages) ELSE 0 END AS queriedPages , SUM(totalPages) AS totalPages FROM #Results WHERE databaseName NOT IN ('master', 'tempdb', 'msdb', 'model') GROUP BY databaseName , tableName ) x GROUP BY databaseName ORDER BY databaseName;
databaseName TotalQueriedPages TotalPages %DataUtil -------------------------- ------------------- -------------------- ---------------------- Database1 127618701 130607247 0.969619893356378 Database2 567188 1614958 0.351209133612143 Database3 34269036 34579469 0.991022620966216 Database4 137970594 170733391 0.803399928206158 Database5 74632930 101543575 0.66909214627557 Database6 55809933 72884205 0.765734157938039 Database7 560810026 620609815 0.902175272517656 databaseName TotalQueriedPages TotalPages %ContentUtil -------------------------- ------------------- -------------------- ---------------------- Database1 127763715 130607247 0.970721679051682 Database2 571125 1614958 0.353646967908763 Database3 34269036 34579469 0.991022620966216 Database4 137970921 170733391 0.803399928206158 Database5 96144726 101543575 0.861947682777784 Database6 72269666 72884205 0.991568146820268 Database7 620525938 620609815 0.998240279711804
The first result set examines the utilization of indexes, and the second result set examines the utilization of data at the content (table) level. For example, if we look at Database6, we’ll see that we are only utilizing 77% of our indexes, but we’re looking at 99% of our table data. So this is a good indicator that we have some unused indexes to clean up in that database.
Know a better way to answer this question using SQL Server DMV’s? Please leave me a comment so I can learn from your experience.
In unrelated news, this may be my last blog post for a little while. I’m due with my second child a week from today and expect all of my free time to be consumed by him for a little while. That and, quite frankly, I do not trust myself near a computer, especially a database, in such a sleep-deprived state.