Tuesday, March 3, 2009

Tuning SQL Queries Using Indexes for Beginners.

We all start off using databases and SQL with a similar goal of "just let me see the data". Once we feel comfortable with getting data and all the meaningful information we can get using join, group by and the aggregation functions, we need to start looking at getting those query results as efficiently as possible. With the advancements in modern database optimizers, TSQL tuning on the actual SQL statement is limited. The increase in performance one can gain simply by changing query syntax is minimal once a level of competency is reached. However, one can affect SQL query performance much more drastically with some basic knowledge about how indexes work.




Start an Index On the Table


An index can be thought of just like an index in the back of a book. You look up a key word in the back and it tells you what page to turn to. Next, you read that page until you find the section with the information needed. Without an index, you would have to read the book from the beginning to get to the section you want. Maybe it is in the beginning and you find it quickly. Maybe it is towards the end and it takes a while. The same is true with databases. An index can tell you, or rather the query engine, what page to get your data and therefore speed things up dramatically.


So, how do we make sure that happens? Well first of all, there is a main index we can use on each table. This index is called a clustered index. A table can only have one clustered index and the index must make each row unique. For example, you could not have a clustered index on a last name field in a table if more than one entry in that table could have the same last name as another entry. A table with a clustered index requires a unique field or a unique combination of fields. Most often, a clustered index will use the primary key but that is not a requirement. The beauty of a clustered index is that all the data from the rest of the table, or rather the other columns in the table, are stored in the same location with the value that is indexed. For example, let’s say we have a simple table for comic books called Comics. This table has the following structure.


Create table Comics (
ComicID int ,
Title Varchar(30),
Author Varchar(30),
DateOfFirstPrint datetime )

We can build a clustered index using the following syntax
Create clustered index ix_comics 
on Comics (ComicID )

If we have a clustered index on ComicID, we can get any other attribute from using ComicID by using the Clustered index. The database optimizer knows this and only requires one pass through the index to find that specific row and retrieve all your data. However, let’s say that you only know the Title and need to find the Author of a particular comic that you have. In this case, your query is something like


Select Author 
from Comics
where Title= 'Pitt'
.

With only the clustered index we described above on the table, the database engine would still need to perform a full scan of the database table and would not use the index. This could take a very long time and is not efficient.


Better Stronger Faster Indexing


The next step we could take towards making queries that use our database table more efficient is to build a second index on the table. This index would be a non clustered index. Remember that we already created a clustered index and a table can have only one clustered index. You can choose another field in the table to act as your index and the non-clustered index will store the clustered index key, in our case ComicID, along with the field you choose to use. Let’s say we build a non-clustered index on Title and re-run the previous query.


Create nonclustered index ix_comics_title
on Comics (Title)

The optimizer would use the new index to find the title 'Pitt'. Once it found the row, it would find the clustered index key, which is the ComicID, for the record 'Pitt'. At this point. The optimizer sends the query to the clustered index using the ComicID to get the remaining data ( Author ). This can be much faster than a full table scan. We just made a really big advancement to our database table's speed and efficiency but there is a more effecient way to get the data we need.


An index can contain more than the clustered index key. It can include any other data from the table (not including certain special data types such as images and XML). For example, we could create a non clustered index using two fields from the table, such as the Title and Author fields. Now our same query would only have to traverse this single index to get all the information needed. This query runs faster now since it uses only one index instead of two indexes. When an index contains all the data that a query needs and no other data access is required to retreive the data, it is called a covering index. The index covered all the data the user required. However, it is imporant to note that we can not create an index for every possible query. We need to determine which queries are going to occur most often and then we can create a few indexes to help those specific queries.


Another important option that was new to SQL Server 2005 is the ability to "include" columns in the index. Included columns are not used in the search to find or filter out the record but are stored with the data once the row is found. So, we could create an index for Title and include Author in it. Since our query is only asking for Title and Author and our index includes those fields, the query optimizer will still only require one index to retrieve the data we need to fulfill this query. In the query, we are using as an example, including the columns would work just as well as a non clustered index using. Additionally, using "includes" results in a smaller index. We could also add more columns to the "include" at a minimal cost to our index size. As your table grows so do its indexes. Keeping an index small will increase efficiency.


Create nonclustered index ix_comics_title_inc_author 
on COMICS ( TITLE )
Include ( Author )

Summary


In review, indexes help the speed of a query. There are two types of queries (clustered and non clustered). Clustered indexes store the actual data and are very efficient if you know the clustering key. If you don't know the clustering key, a non clustered index should be built to access your data using the data you do have. You can also inlcude data in a non-clustered index. We use includes when we don't necessarily want to search or filter the query based on certain data but we want to return that data in the query. If an non-clustered index has all the required data available to it and can return a full result set without accessing a second query, it is called a covering index.



sql server index