Improving SQL Server Large Table Performance

Recently I encountered a table of astronomic size. The table is 27.7 GB with ~32.8 million rows. The table only had a single index and the query (below) fetches 4 columns using 3 aggregates filtering the last 6 months from GETDATE(). Below I will explain my methodology and how I was able to cut execution time by 36.996% and subtree cost by 84.703%.

SELECT id1, id2, MAX(desc1), COUNT(id1), MAX(col1) FROM table1
WHERE date1 >= DATEADD(mm, -6, GETDATE())
GROUP BY id1, id2

Initially a full table scan was performed because there was no useful index, then a sort. The full table scan cost 38% of the entire execution and the sort cost 59%. The remaining 3% was for parallelism and stream aggregation. Below are the results of execution stats for using a table scan.

Table Scan Performance

	Trial 1	Trial 2	Trial 3	Average
Rows selected by SELECT statements	46479	46479	46479	46479
Bytes received from server	7322718	7307102	7309784	7313202
Client processing time (ms)	480353	477377	477780	478503.3
Total execution time (ms)	996390	976525	998812	990575.7
Wait time on server replies (ms)	516037	499148	521032	512072.3

My initial reaction was to create an indexed view as this would create a physical result set in the database saving overhead. The subset created by the view was about 19.5 GB which is 70% of the base table (again, only concerned with the last 6 months). This was promising, however, when it came time to create an index on this view I ran into some problems. I attempted to create a non-clustered index on the view that was based on the resulting columns, however it required a clustered index first. I was unable to create a clustered index because the resulting subset did not contain any unique columns. This approach failed. In addition, there would be performance issues with this approach when performing CREATE, INSERT, or UPDATE requests.

I then opted to create a new non-clustered index on the base table. This allowed me to create a filter predicate to specify the last 6 months of data, however, without the non-deterministic function (GETDATE()) that was being used in the query. The last 6 months of data would now be hard coded into the index and I would force an index query hint in the query (...FROM table1 WITH (INDEX(idx)) WHERE date1 > 6 months ago). The index was based on the date, and included the resulting 4 columns. If you try this approach, make sure that any filter being used in the index matches your query.

While this was the optimal solution, it would require that the index be rebuilt every month to filter the last 6 months from the current date. The rebuilding wouldn’t save anytime overall, therefore the filter predicate was removed from the index. In addition, I used an index with included columns CREATE INDEX index1 ON table1 (col1) INCLUDE (col2, col3) versus CREATE INDEX index1 ON table1 (col1, col2, col3) because the included columns are at the leaf level (rather than in the tree) making the index smaller. As this MSDN article explains, only columns that are used for searching should be used as keys.

CREATE INDEX index1 ON table1 (date1)
INCLUDE (id1, id2, desc1, col1)

Index Scan Performance

	Trial 1	Trial 2	Trial 3	Average
Rows selected by SELECT statements	46479	46479	46479	46479
Bytes received from server	7267378	7274764	7260844	7267662
Client processing time (ms)	491357	492806	490452	491538.3
Total execution time (ms)	628255	622815	621250	624106.7
Wait time on server replies (ms)	136898	130009	130798	132568.3

With this approach, an index scan replaces the table scan and costs 9%, the sort 85% and 6% for parallelism and aggregation.

Table vs. Index Scan

	Table Scan	Index Scan
Actual Number of Rows	23107704	23107704
Estimates I/O Cost	2712.63	405.045 (▼ 2307.585)
Estimates CPU Cost	18.2019	12.6975 (▼ 5.5044)
Estimated Operator/Subtree Cost	2730.84 (38%)	417.742 (9%) (▼ 84.703%)
Estimated Number of Rows	23011900	23086200
Estimated Row Size	205 B	197 B (▼ 8)

In summary, the index scan cut total execution time by an average 366469 ms (▼ 36.996%) and cut wait time on server replies by an average of 379504 ms (▼ 74.1%).


May 13, 2012
Links

Definitely the most sensible approach in my opinion.

I have tested materialized views in the past and although the performance is great for SELECTs, as you mentioned other DML statements tends to suffer massively.

Am concerned that you are getting an Index SCAN when you have an index that can qualify for the faster SEEK (on Date) operation. How many rows are being processed as part of that SELECT statement?

You could also get around the rebuilding of the Filtered Index issue by creating the filter on a new (bit?) column created on that table indicating the last six month, which needs to be updated regularly… it saves the hassle of rebuilding the whole index, but requires schema changes to the table itself.
- Reply
  
  May 13, 2012
  Norbert
  
  Thanks for your reply.
  
  I am not concerned with the index scan itself while using an index in the execution plan. I’m quite satisfied with the results of the index, the only thing I wish to have improved is the sort performance. There are 23 million rows getting processed for the SELECT statement.
  
  I’m curious what the execution time would be on your suggestion to have a bit column indicating the last 6 months. I am imagining that the time needed to update the column each month may negligible.
  
  Lastly, it seems as though this index allows for more flexibility if for example, the filter changes to the last 5 months. This way multiple update queries could be avoided.

Improving SQL Server Large Table Performance

Table Scan Performance

Index Scan Performance

Table vs. Index Scan

Norbert Krupa

2 Comments

Leave a Reply Cancel reply

Archives

Meta

Table Scan Performance

Index Scan Performance

Table vs. Index Scan

Share this:

Norbert Krupa

2 Comments

Leave a Reply Cancel reply

Archives

Meta

Tags