SQL Server Columnstore Indexes
This discussion is a starting point for providing the inital platform for knowledge sharing & discussion of various aspects regarding newly introduced MS SQL Server 2012/2014 Columnstore Indexes .
Overview:
A Columnstore index can be defined as a technology for storing, retrieving and managing data by using a columnar data format, called a Columnstore. SQL Server supports both clustered and nonclustered Columnstore indexes. Both use the same in-memory Columnstore technology, but they do have differences in purpose and in features they support.
The SQL Server in-memory Columnstore index (formerly called xVelocity) stores data by columns instead of by rows, similar to a column-oriented DBMS. The Columnstore index speeds up data warehouse query processing in SQL Server 2012 and SQL Server 2014, in many cases by a factor of 10 to 100. In the tests conducted, Columnstore indexes shows up to 57 times faster query response time improvement and up to 95% times lesser space requirements over traditional row based indexes. Columnstore indexes work well for data warehousing workloads that primarily perform bulk loads and read-only queries. Since the clustered Columnstore index is updateable, the workload can perform a large number of insert, update, and delete operations.In Clustered Columnstore indexes, when data is inserted into a table using trickle INSERT(ordinary INSERT statements that do not use the BULK INSERT API),UPDATE & MERGE statements and undersized BULK INSERT operations (less than 1,048,576 rows) , the data goes into deltastore.Deltastores store data in the traditional row-mode (they are B-Trees) and as such are significantly more expensive to query than the compressed columnar segments – columnstore. They are equivalent to storing the data in an uncompressed heap and, due to small size (max 1,048,576 rows per deltastore row set), they get little traction from parallelism and from read aheads. The Clustered Columnstore indexes rely on a background task called the Tuple Mover to periodically compress deltastores into the more efficient columnar format. Thus, it is important to achieve a healthy columnstore index, meaning no deltastores or only a few deltastores.
Benefits of Columnstore indexes:
– Columns often have similar data, which results in high compression rates.
– High compression rates improve query performance by using a smaller in-memory footprint. In turn,query performance can improve because SQL Server can perform more query and data operations in-memory.
– A new query execution mechanism called batch-mode execution has been added to SQL Server that reduces CPU usage by a large amount. Batch-mode execution is closely integrated with, and optimized around, the columnstore storage format. Batch-mode execution is sometimes known as vector-based or vectorized execution.
– Queries often select only a few columns from a table, which reduces total I/O from the physical media.
Clustered Columnstore indexes result into high compression of data thus resulting into saving of lot of disk space. For example, in the tests conducted a table with size around 107 GB got compressed into 6 GB using Clustered Columnstore index hence saving around 100 GB of disk space in this case (95% compression).Based on the Rate Card for Asia region (Enterprise SAN – silver tier storage), this translates into saving of 115 USD per month and 1380 USD annually.High level of compression achieved using Clustered Columnstore indexes reduces the overall size of the database and backup also. The database and backup size is reduced by total space saved using Clustered Columnstore indexes on individual tables. With smaller backup, the database restore also become faster.
Columnstore indexes work well for mostly read-only queries that perform analysis on large data sets.Columnstore indexes give high performance gains for queries that use full table scans, and are not well-suited for queries that seek into the data, searching for a particular value.
Columnstore Index has two versions: Clustered and Nonclustered
The initial implementation columnstore indexes in SQL Server 2012 had several limitations that are remedied in the SQL Server 2014 release: column store indexes are updatable, they can be used as the primary storage of a table, they can be further compressed to save disk space, and batch-mode processing has been significantly extended and enhanced. SQL Server 2014 has the features of SQL Server 2012 plus updateable clustered Columnstore index.A clustered Columnstore index is the physical storage for the entire table and is the only index for the table. The clustered index is updateable. You can perform insert, delete, and update operations on the index and you can bulk load data into the index. To reduce fragmentation of the column segments and improve performance, the Columnstore index might store some data temporarily into a rowstore table, called a deltastore, plus a B-Tree of IDs for deleted rows. The deltastore operations are handled behind the scenes.
To return the correct query results, the clustered Columnstore index combines query results from both the Columnstore and the deltastore.
In SQL Server, a clustered Columnstore index (applies to SQL Server 2014 through SQL Server 2014):
– Is available in Enterprise, Developer, and Evaluation editions.
– Is updateable.
– Is the primary storage method for the entire table.
– Have no key columns. All columns are included columns.
– Is the only index on the table. It cannot be combined with any other indexes.
– Can be configured to use Columnstore or Columnstore archival compression. Does not physically store columns in a sorted order. Instead, it stores data to improve compression and performance.
A nonclustered Columnstore index is a read-only index created on an existing clustered index or heap table. It contains a copy of a subset of columns, up to and including all of the columns in the table. The table is read-only while it contains a non clustered Columnstore index. A nonclustered Columnstore index provides a way to have a Columnstore index for running analysis queries while at the same time performing read-only operations on the original table.
In SQL Server, a nonclustered Columnstore index (applies to SQL Server 2012 through SQL Server 2014):
– Can index a subset of columns in the clustered index or heap. For example, it can index the frequently used columns.
– Requires extra storage to store a copy of the columns in the index.
– Is updated by rebuilding the index or switching partitions in and out. It is not updateable by using the DML operations such as insert, update, and delete.
– Can be combined with other indexes on the table.
– Can be configured to use Columnstore or Columnstore archival compression.
– Does not physically store columns in a sorted order. Instead, it stores data to improve compression and performance. Pre-sorting the data before creating the Columnstore index is not required, but can improve Columnstore compression.
Operational Considerations :
– Memory and Disk Space Considerations :
Columnstore index is part of Microsoft’s family of in-memory technologies. The creation a columnstore index operation occurs in memory. The amount of memory required for creating a columnstore index depends on the number of columns, the number of string columns, the degree of parallelism (DOP), and the characteristics of the data. SQL Server will request a memory grant before trying to create the index. If not enough memory is available to create the index in parallel with the current max DOP, SQL Server will reduce the DOP as needed to get an adequate memory grant. If SQL Server cannot get a memory grant to build the index with DOP = 1, the index creation will fail. A rule of thumb for estimating the memory grant that will be requested for creating a columnstore index is:
Memory grant request in MB = [(4.2 *Number of columns in the CS index) + 68]*DOP + (Number of string cols * 34 )
The creation of columnstore index will fail either at the very beginning of execution if it can’t get the necessary initial memory grant, or later during execution if supplemental grants can’t be obtained. Once the create operation finishes, the compressed table is stored on disk. The Rebuild/Reorganize operation also happens in memory .The drop columnstore index operation happens on disk and thus there should be enough free disk space available for the process to get completed successfully. The space available should be roughly equal to as much as the size of raw table.Hence, it is advisable to consider memory and space requirements carefully while dealing with columnstore indexes.
-Rebuilding Clustered Columnstore Index
Clustered Columnstore index requires to be rebuilt after large number of update and delete operations.Nonclustered Columnstore index can’t be updated, deleted or inserted data into. Hence, no such condition applies to clustered Columnstore index.
Rebuilding a clustered Columnstore index is useful for removing fragmentation, and for moving all rows into the columnstore from deltastore or rowstore. Rebuilding the entire table takes a long time if the index is large, and it requires enough disk space to store an additional copy of the index during the rebuild. Usually it is only necessary to rebuild the most recently used partition. Rebuilding will delete all rows from the Columnstore that are marked for deletion, and it will move all rows from the deltastore into the Columnstore.Rebuilding will move all deltastore rows into the Columnstore.Reorganizing a clustered Columnstore index moves all CLOSED rowgroups into the Columnstore. To perform a reorganize, use ALTER INDEX (Transact-SQL) with the REORGANIZE option. Reorganizing is not required in order to move CLOSED rowgroups into the Columnstore. The tuple-mover process will eventually find all CLOSED rowgroups and move them. However, the tuple-mover is single-threaded and
might not move rowgroups fast enough for the workload.Please note, it is better to rebuild the clustered Columnstore index than to recreate it.Rebuilding clustered Columnstore index is faster than recreating.
Conclusion :
Columnstore indexes are best suited for data warehouse kind of workloads consisting primarily of read-only queries that analyze large sets of aggregated data. These types of queries usually require full table or index scans to retrieve the necessary information. However, Columnstore indexes provide little advantage to queries that rely on seek operations to locate specific information.Frequent updates and deletes can also diminish the advantages of a Columnstore indexes. Nonclustered
Columnstore indexes need to be dropped and re-created and clustered Columnstore index updates/deletes result in poorer query performance hence may need to be rebuilt. Tables primarily limited to bulk load operations are usually better candidates. Hence, Columnstore indexes in SQL Server 2014 can be very effectively used by applications having data warehouse type scenarios, but if there are many routinely update and delete operations on lot of rows, then the track of deleted_rows value in DMV sys.column_store_row_groups would be required otherwise it can lead to performance degradation.