SQL Optimization

SQL optimization is a transformation of SQL statements, which
produces exactly the same result, as an original query. This process
requires a lot of creativity and intuition; it is more an Art than a Science.
It is impossible to teach it in a small article, so I don't even try to
do it. Instead, I want to limit myself by classifying types of
optimization. So, when you are asked to optimize a database, you need
to know, what to ask and what to expect.
The result, produced by the optimized code must be exactly the same
as it was before: the second worst nightmare of a DBA (the first one is
losing data) is someone running to him crying 'you know, after you last
optimization all customer balances became negative!' So I will focus on
Potential danger on every level.
For that reason all types of optimization are sorted in order of
increasing intrusiveness, and I begin from the non-intrusive
optimization.
0. Non-intrusive optimization
It is upgrading SQL server hardware and changing of SQL server
parameters. Adding more memory or processor power wont harm, you don't
need to test your product again.
Upgrading hardware you can spend a lot without any effect, because
it is possible that your server is suffering from Locks, while you will
be adding more memory - for vain. There is an easy way to determine,
what (in general) is the most serious problem on the server: CPU power,
IO, or Locks. To do it, use Lakeside Upgrade Advisor.
You can accelerate your IO by putting LOG and DATA on different
devices, separating indexes from data, but unfortunately, it is more
and more difficult to apply these old good recommendations in the
modern world. In many companies 'a server' is built based on the same
specification no matter if it is a file server or a database server.
One big RAID-5 drive is almost standard for the small servers. If you
ask them for separate drives for LOG and DATA, they will give you 2
separate LUNs on the same RAID-5 drive and on the same physical
channel, which is absolutely useless.
For the bigger servers more and more companies use big disk arrays
like SAN EMC - these arrays are very good for the OLAP systems, but are
bad for the OLTP.
Changing SQL server parameters also relatively safe, because you
don't change the code. Unfortunately, there is no parameter
'TURBO=TRUE', so you can play with these parameters, but in most cases
SQL server self-tuning is quite good, and effects of tuning can be
beyond measurement.
For the given hardware, applying all the recommendations you can
find in the Internet, you can gain 20-30%, even less if there is
nothing really wrong with you server. So, if your queries are running
several times slower then expected - then you don't have other choice
but change your TSQL code.
Potential danger: None
1. Indexes
On this level of optimization, we change the database by creating and changing indexes, but we don't touch the code.
Of course, any DBA is very conservative about changing the code,
especially on the production. In some cases it is possible to make some
modifications which are not really considered as 'changes' by the
conventional programming.
At first, all you can do with indexes, can affect the execution
plan, but the result is guaranteed to be the same (unless there are
bugs in the SQL server, but Microsoft spent much more time on testing
it then it is spent on any query you work with, so below we can think
about SQL server as a 'bug-free' product, which is not true, but is
almost true in comparison with the quality of our code)
You can create, drop indexes, change index types
(clustered/non-clustered), include or exclude columns from indexes. On
SQL 2005, check a new index option 'INCLUDING'. Don't try to index
everything - you will slow down updates. If SQL server does not use an
index, there may be a very good reason behind that decision.
Potential danger: Is it absolutely safe?
For most of the databases, it is safe. SQL server will automatically
adjust plans based on new indexes. However, if your database uses not
only DML but also a DDL, and it is generating tables and indexes on the
fly, then it is possible that some
exec('drop index '+@indname)
would fail. So be careful.
2. Hints and options
To use a hint you should modify an existing SQL code. Hence, it is a
modification, and theoretically, you need to retest all the system. But
with an assumption that SQL server is almost bug-free, a query with any
hint produces exactly the same result as without a hint.
So check the execution plan and check the following:
-
(index=...) - but be careful, usually, when SQL server ignores an index, there is a very good reason why.
OPTION (HASH JOIN) - try also other types of joins.
OPTION (FORCE ORDER)
OPTION (FAST 1) - these 3 options above are good candidates to try even without thinking, sometimes it helps.
(NOEXPAND) for indexed views
- Option
WITH RECOMPILE for a stored procedure
and some others...
But if it does not help, there is no other choice then change the code
Potential danger:
- Some modifications can have a catastrophic impact on the performance.
- Some options can result in error type
'Query processor could not produce a query plan because of the hints defined in this query.'
- Some queries can fail on some data sets, check
OPTION (ROBUST PLAN)
for details
3. Syntax-level optimization
The most primitive type of the optimization is based only on the
text of a query. We don't know even the underlying database structure,
so we are limited by the subset of syntax transformations, which do not
change the result in any case. For example:
select 'Low',A,B from T
union
select 'High',A1,B1 from T1
In this case we merge both datasets together and get rid of the
duplicates. However, there are no duplicates because the first column
is different for 2 sub-queries. Hence we can rewrite it as:
select 'Low',A,B from T
union all
select 'High',A1,B1 from T1
Another example.
update T set Flag='Y'
update T set Ref=0
In both queries we affect all the records, so we can change both columns at the same time:
update T set Flag='Y',Ref=0
We can merge together updates in even more complicated cases:
update T set Flag='Y'
update T set Ref=0 where Comment like '%copied%'
Both updates use a full table scan, so lets do both updates during the same pass:
update T set Flag='Y',
Ref = case when Comment like '%copied%'
then 0
else Ref -- keep the same
end
A Trap:
Such type of optimization can be dangerous. For example,
update T set Ref=0
update T set Amount=StoredValue
It is almost the same case as we had before. So we can rewrite it as
update T set Ref=0,Amount=StoredValue
Right? Yes, this transformation is correct... in 99% of cases, but not
always. The possible problem is that StoredValue can be a calculated
column, which depends on Ref. For example, lets say it is defined as
StoredValue=Ref+1
The original statement sets Ref to 0, and Amount to 1. Our optimized
query works differently, as UPDATE at first calculates all right parts
and only then assigns them to the columns. Therefore, it will set Ref
to 0, but Amount will be set to the old value of Ref plus 1.
Potential danger: As it was demonstrated
above, we should test the code after that modification. All levels
above are intrusive and require verification.
4. When we know database schema
When we know database scheme and we have a list of indexes we can make more serious changes. For example, when we see:
update T set Flag='A' where Opt=1
update T set Flag='B' where Opt=2
We can check if there is an index on the column Opt. Assuming, that
an index is created on a column with a high selectivity, we can leave
there statements as is. However, if there is no index on Opt, we can
rewrite it as:
update T
set Flag=case when Opt=1 then 'A' else 'B' end
where Opt in (1,2)
to make an update during the same scan.
Potential danger: Testing is required
5. When we have an access to the actual data and know the meaning of the data
We can do it only when we have an access to the actual data. In the
example above, we could determine if Opt is a column with a high
selectivity. To do it we should compare the results of 2 queries:
select count(*) from T
select count(distinct Opt) from T
What is more important, we can find values with irregular
selectivity and handle them properly. Check this article to read more
about irregular selectivity: http://www.sqlsolutions.com/articles/articles/How_Values_with_Irregular_Selectivity_Impact_SQL_Server_Database_Performance.htm
But sometimes table and column names help, so we understand the
situation even without looking at the actual data. In what case, do you
think, SQL server uses an index and where it uses a full table scan?
update Patients set Flag='Y' where Gender='F'
update Sales set Flag='N' where Id=13848
In the first case SQL server uses a table scan, because Gender
column (M/F) has extremely low selectivity (about 50%). We can even
say, that Flag is a low selectivity value in both cases with values
'Y','N' and probably few others. But syntaxically, both queries are
identical.
Another example. Assuming that Gender is NOT NULL, does this statement update all the records?
update Patients set Flag=0 where Gender in ('M','F')
Yes. Now let's change the table name:
update Nouns set Flag=0 where Gender in ('M','F')
We update a noun, which have a grammar category of Gender, which may
be 'M', 'F' or Neutral! As you see, the only difference is a table
name, so we actually guessed the meaning.
We can base our optimization based on many real world constraints, like
- age is less then 255 and non-negative,
- mass is positive,
- InternalId int identity correlates with a date of the document
- SSN is unique.
Potential danger: Requires not only testing
but also verification, that constraints are not violated now (such type
of verification is possible by querying the existing data) and in the
future (not all data may be inserted). You can find that:
- Special value of -1 is used for age as some weird flag
- mass can be negative because sometimes it is delta from the actual and expected value
- InternalId int identity correlates with a date of the document except
for the documents, imported from another database a year ago
- SSN is unique, except for the value 'N/A', used for all illegal immigrant workers.
5. Actual or expected workload.
Some solutions have 2 sides: they accelerate data retrieval, for
example, but slow down the modification. Very good example is a
technique of 'indexed' or 'materialized'
views. Indexed views can
accelerate selects, but they slow down any updates significantly,
especially massive updates.
So before you make any modifications, you should know if data is
relatively static or not, how man by updates you expect per one select
etc.
On the actual data load the best way to make such analysis is to use SQL Trace Analyzer.
Conclusion
You need to ask right questions what you are asked to 'look, what is
wrong with this query' or 'optimize a stored procedure'. Before you
start an optimization, you need to know, what level of intrusiveness is
allowed and what information do you have as input.
|