Partitioning 101

A few posts ago, I briefly discussed partitioning. This article will continue that discussion and focus on horizontal partitioning within SQL Server 2005.

What is partitioning?

As I’ve mentioned before, horizontal partitioning is the physical segregation of a single logical table name into multiple, *identical* physical structures. In SQL 2005, all table and index pages actually exist on a partition. However, if you’ve never created a partitioning scheme, then by default each of your tables and indexes contain just a single partition.

I’ll also be talking about aligned indexes and tables, so let’s go ahead and define that here. Two tables sharing the same partitioning function and the same partitioning key are aligned. Similarly, a partitioned index on a partitioned table sharing the same partitioning key is called an aligned index. Having tables and indexes in alignment can reduce CPU and improve parallel plan execution.

Why partitioning?

The two most common reasons to implement partitioning are performance improvement and ease of archiving. Let’s talk about performance improvement first. We made the decision to implement partitioning in one of our data warehouses because of the size of the tables. We had a couple of tables that exceeded a billion rows, and several tables that were fast approaching that line. Upon deploying partitioning, CPU utilization dropped by ~25%, and our regular CPU and disk spikes caused by reporting processes were all but eliminated. Many of our stored procedures experienced reduced durations and reads in the range of 30-60%, and some had even greater improvements.

Why were we seeing this improvement? Many of the stored procedures were accessing data for recent dates. Knowing this helped us to decide on a weekly partitioning scheme. Stored procedures joining on aligned tables were now only accessing a small portion of the data, resulting in the aforementioned improvements. In addition, ad hocs that performed index scans now completed in a fraction of the time because the amount of data being scanned was significantly reduced.

Table maintenance also saw drastic improvement. It was previously unheard of to attempt an online index defrag on some of the largest tables without first scheduling down-time. Now, these same tables were being defragged nightly in one minute or less, without issue. This is because index defrag operations can be performed by partition.  This is a great resource save; obviously, once a partition is no longer written to and is fully defragmented, it will not become re-fragmented, so there is no need to waste resources on it. Refer to my Index Defrag Script for an example on how to automate this process.

Another great feature of partitioned tables is the ability to swap partitions in and out. I took advantage of this feature with great success when backfilling data. My backfill strategy primarily consisted of bcp’ing data in weekly chunks to a staging database. The table I was backfilling had a text column where one record happened to contain the column deliminator for the bcp file. This of course wasn’t caught until auditing data counts prior to deployment, when it was noticed that one week was short about ~2mm records. Now, this was ~2mm records out of ~800mm total records. The solution? Re-run the bcp process, using a different deliminator, dump the data in a staging table, and swap out the partition. The actual swap out took 2 seconds and there was no impact to the server. It was beautiful.

Swapping out partitions is particularly beneficial for archiving. Instead of writing a batch process to loop through and delete expired data, which is both time consuming and resource intensive, a partition can be swapped out and dropped in a matter of seconds. Keep in mind that, in SQL Server 2005, partitions can only be swapped out on non-replicated tables. SQL 2008 introduces functionality to allow partition swapping on a replicated table.  Check out Replicating Partitioned Tables and Indexes.

That sounds great! Where do I sign up?

There are a few things to consider prior to implementing partitioning in your environment. One is the performance hit to writes. I’ve noticed around a 10% decrease in write performance in my environment. This is partly due to the overhead required to look up which partition the record should be written to. For this reason, I only partition tables that I expect to grow quickly (my rule of thumb is >10mm records per week), or tables that have regular archiving needs.

There’s also a performance hit when querying over many partitions. Basically, think of a giant UNION ALL statement that occurs every time more than one partition is accessed. With our production hardware, we noticed that duration was still less until you hit around 30 partitions. After that, the duration was greater for partitioned tables versus non-partitioned, but reads were still less overall. This will probably vary depending on your environment, so run some tests and make sure you know what kind of impact to expect. This will also help you to determine your partitioning scheme. For example, you wouldn’t want to partition on a date column using a weekly scheme if every report needs to access data for the previous 12 months.

Also make sure you have a good candidate for partitioning. That is, do not partition on a column that is frequently changed or is written non-sequentially. Both can cause overhead to your environment and may negate the benefits of partitioning. Keep in mind that when the value of the partitioning key is changed, the record is not relocated to the correct partition. Some ideal keys include surrogate identity and sequential datetime columns.

Other things to know…

• The partitioning function requires the declaration of a parameter data type. To partition a table, the partitioning key is defined during new table creation. The data types of the partitioning keys must match. For example, if you create a partitioning function using a datetime data type and try to apply it to a table with a smalldatetime column as the partitioning key, the CREATE TABLE statement will fail.

• Ensure that the partitioning key is *always* used in a query on a partitioned table. If the partitioning key is not included in the query search criteria, then the partitioned table will perform worse than a non-partitioned table.

• By default, any index created on a partitioned table is also partitioned. This is useful for creating aligned indexes, which are typically desired. Sometimes, however, you may want to create a non-aligned or non-partitioned index. In these cases, make sure to explicitly declare the desired filegroup during the index creation process.

• Temporary tables can also be partitioned; however, this would probably not make sense in most circumstances.

• Partitioning functions need to be monitored to ensure you don’t run out of defined ranges; if that were to happen, you’d be storing all new data in one really large partition, which wouldn’t be pleasant.

Thanks for wading through this long post. I hope the information contained proves informative and helpful. I’ll be following this post with some actual code to help illustrate some of what I’ve covered here.

Regards,

Michelle

0saves
If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.
Tagged . Bookmark the permalink.

6 Responses to Partitioning 101

  1. Pingback: SQL Fool

  2. Pingback: Partitioning Tricks : SQL Fool

  3. Venky Subramaniam says:

    Hi Michelle,
    Interesting article…though I am not sure I understood certain words and abbreviations (for example, I assume that the word you used as ‘deliminator’ is delimiter??..And ‘mm’ against the number of records is basically meaning millions?). Sorry to sound obtuse, but maybe the definitions and abbreviations have changed and I am not aware :-) . However, with regard to partitioning I have a few questions and i hope that you can shed some light…
    ——————————————————————————————–
    I have partitioned 5 large tables across 10 partitions (with 10 filegroups) and with a set of 5 – 6 users the performance seems to be adequate, with minimal conflicts (like deadlocks etc), but I want to implement this scheme that can scale to say 300+ users. With regard to this I have the following questions:

    – What is the maximum number of partitions that works optimally? In other words, even though SQL Server allows 1000 partitions per table, it would become an administrative nightmare were we to deploy such a scheme.

    – **Degradation Concerns**: What would be the maximum (but practical) number of partitions that one can deploy before one has to worry about degradation in performance and concurrency?

    – If the number of users access the same partition, what would be the negative impact on the overall scheme (in terms, again of performance)?

    – Does a multi-CPU environment enhance the overall performace? Of course, we can suggest to the client to use RAID based storage with data striped across multiple drives?

    – Any other aspects related to partitioning that can help me for implementation?

    MORE DETAILS that might help you hone on the answers if any :

    1. The largest table holds about 50 million rows as of now (but this is only a test db..) In other words it can grow to a billion.

    2. As of now I have about 10 partitions defined out of which only the first 6 are occupied. The number 10 is not a holy number or restrictive in any sense. I have just chosen this number to be at a manageable level.

    3. The data density in other 4 tables can vary from 1 – 2 million.

    4. Since the tables involved participate in JOIN operations, I have currently used the same partition function for all and the data distribution is based on a ‘derived’ partition_column called partition number. This way the associated tables are in the same partition. Each partition is associated with one filegroup that comprises of many files.

    5. I have also aligned the indexes for the tables across the same partitions.

    6. As for the available hardware, I need to contact the client as to what they have and whether they can upgrade if needed. I needed a ballpark infrastructure that I can use to recommend. In other words, at this juncture, I cannot state with confidence how powerful their servers are, but given a recommendation about the requirements, I am sure they would comply. I DO know that their servers are multi-CPU.

    7. As for the RAID configuration to be suggested, would RAID 1 and RAID 10 serve the purpose? Both configurations conform to redundancy

    Does this information suffice or do I need to provide more?

    Please come back with your suggestions or questions and I shall try to provide more information if needed. While I can find several articles of partitioning (even from Microsoft MVPs), I would like answers in particular, to address my concerns expressed above.

    With best regards,
    Venky

  4. PREM says:

    Excellent

  5. Pingback: Partitioning 101 | Sladescross's Blog

  6. Pingback: Good Compression and Bad Partitioning | SQLCowbell

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>