Tips for Large Data Stores

I’m currently working on projects that involve the collection of massive amounts of data (i.e. terabyte-class, billion-row tables). One of the challenges of collecting so much data is the ability to report on the data quickly.  To this end, I plan to spend the next several blog posts discussing tips for designing and managing large relational data stores.  These are primarily design practices that have performed well for me within the various environments in which I’ve worked.


• Partitioning is the segregation of a single logical table into separate, identical physical structures. Basically, every table is stored in at least one partition; “partitioned tables” are merely comprised of many partitions.

• Implemented correctly, partitioning can have dramatic improvement on read performance, index maintenance cost reduction, storage of large fact tables on separate disks, and data archiving.

• If you’re not familiar with partitioning, check out Kimberly Tripp’s excellent white paper, Partitioned Tables and Indexes in SQL Server 2005.

• I cannot rave enough about the impact partitioning has had within our environment. To take one of the most extreme examples of improvement, the execution time of one BI report decreased from 2.5+ hours to 20 minutes after we implemented partitioning. While not all stored procedures experienced such phenomenal improvement, improvement in the range of 30% – 60% was very common.

• While partitioning does have many benefits, there are some negatives. Namely, more expensive writes (I’ve noticed around 10%), increased duration of queries that span many partitions (i.e. queries on long time spans, such as a year), and increased maintenance needs.

• My rule of thumb is to partition any table with growth rates > 10mm records per week or with regular archiving needs.

• Much more to come on partitioning in the near future. :)


• Any experienced DBA can tell you that effective index management is critical.

• Ensure your stored procedures are using index seeks instead of scans. If scans cannot be avoided, consider using partitioning with aligned indexes to limit the amount of pages scanned.

• Periodically look for un-used indexes and remove them to improve write performance.

• Look for missing indexes, create one or two, and evaluate. Try composite indexes and included columns as a way to limit the amount of indexes you need to create.

• Defrag your indexes! Nightly, if you can get away with it.

Stored Procedure Tuning

• When querying on large amounts of data, try breaking your queries up into individual components and storing in temporary tables, then performing the joins.

• When querying on large date ranges, try looping through the days and inserting the data into a temp table. This can be especially beneficial when querying partitioned tables.

• Create indexes on your temporary tables!

• Utilize index seeks whenever possible.

• If permitted to use dirty reads and your environment’s isolation level is not defaulted to read uncommitted, use With (NoLock) hints.

• Temporary tables can be partitioned! This is usually not necessary but can sometimes be beneficial.

• Try replacing table variables with temporary tables. My rule of thumb is this: if I’m performing joins on the temporary data or if I have more than 100 records, use a temporary table. Table variables can decrease stored procedure recompiles but can sometimes have a negative impact on performance. When in doubt, test each method and evaluate which works better for your needs.

• *Always* include the partitioning key when querying partitioned tables.