2

Improving SQL Server Large Table Performance

Recently I encountered a table of astronomic size. The table is 27.7 GB with ~32.8 million rows. The table only had a single index and the query (below) fetches 4 columns using 3 aggregates filtering the last 6 months from GETDATE(). Below I will explain my methodology and how I was able to cut execution time by 36.996% and subtree cost by 84.703%.

SELECT id1, id2, MAX(desc1), COUNT(id1), MAX(col1) FROM table1
WHERE date1 >= DATEADD(mm, -6, GETDATE())
GROUP BY id1, id2

Initially a full table scan was performed because there was no useful index, then a sort. The full table scan cost 38% of the entire execution and the sort cost 59%. The remaining 3% was for parallelism and stream aggregation. Below are the results of execution stats for using a table scan.

Table Scan Performance
  Trial 1 Trial 2 Trial 3 Average
Rows selected by SELECT statements 46479 46479 46479 46479
Bytes received from server 7322718 7307102 7309784 7313202
Client processing time (ms) 480353 477377 477780 478503.3
Total execution time (ms) 996390 976525 998812 990575.7
Wait time on server replies (ms) 516037 499148 521032 512072.3

My initial reaction was to create an indexed view as this would create a physical result set in the database saving overhead. The subset created by the view was about 19.5 GB which is 70% of the base table (again, only concerned with the last 6 months). This was promising, however, when it came time to create an index on this view I ran into some problems. I attempted to create a non-clustered index on the view that was based on the resulting columns, however it required a clustered index first. I was unable to create a clustered index because the resulting subset did not contain any unique columns. This approach failed. In addition, there would be performance issues with this approach when performing CREATE, INSERT, or UPDATE requests.

I then opted to create a new non-clustered index on the base table. This allowed me to create a filter predicate to specify the last 6 months of data, however, without the non-deterministic function (GETDATE()) that was being used in the query. The last 6 months of data would now be hard coded into the index and I would force an index query hint in the query (...FROM table1 WITH (INDEX(idx)) WHERE date1 > 6 months ago). The index was based on the date, and included the resulting 4 columns. If you try this approach, make sure that any filter being used in the index matches your query.

While this was the optimal solution, it would require that the index be rebuilt every month to filter the last 6 months from the current date. The rebuilding wouldn’t save anytime overall, therefore the filter predicate was removed from the index. In addition, I used an index with included columns CREATE INDEX index1 ON table1 (col1) INCLUDE (col2, col3) versus CREATE INDEX index1 ON table1 (col1, col2, col3) because the included columns are at the leaf level (rather than in the tree) making the index smaller. As this MSDN article explains, only columns that are used for searching should be used as keys.

CREATE INDEX index1 ON table1 (date1)
INCLUDE (id1, id2, desc1, col1)
Index Scan Performance
  Trial 1 Trial 2 Trial 3 Average
Rows selected by SELECT statements 46479 46479 46479 46479
Bytes received from server 7267378 7274764 7260844 7267662
Client processing time (ms) 491357 492806 490452 491538.3
Total execution time (ms) 628255 622815 621250 624106.7
Wait time on server replies (ms) 136898 130009 130798 132568.3

With this approach, an index scan replaces the table scan and costs 9%, the sort 85% and 6% for parallelism and aggregation.

Table vs. Index Scan
  Table Scan Index Scan
Actual Number of Rows 23107704 23107704
Estimates I/O Cost 2712.63 405.045 (▼ 2307.585)
Estimates CPU Cost 18.2019 12.6975 (▼ 5.5044)
Estimated Operator/Subtree Cost 2730.84 (38%) 417.742 (9%) (▼ 84.703%)
Estimated Number of Rows 23011900 23086200
Estimated Row Size 205 B 197 B (▼ 8)

In summary, the index scan cut total execution time by an average 366469 ms (▼ 36.996%) and cut wait time on server replies by an average of 379504 ms (▼ 74.1%).

Norbert Krupa

Technical Consultant

2 Comments

  1. Definitely the most sensible approach in my opinion.

    I have tested materialized views in the past and although the performance is great for SELECTs, as you mentioned other DML statements tends to suffer massively.

    Am concerned that you are getting an Index SCAN when you have an index that can qualify for the faster SEEK (on Date) operation. How many rows are being processed as part of that SELECT statement?

    You could also get around the rebuilding of the Filtered Index issue by creating the filter on a new (bit?) column created on that table indicating the last six month, which needs to be updated regularly… it saves the hassle of rebuilding the whole index, but requires schema changes to the table itself.

    • Thanks for your reply.

      I am not concerned with the index scan itself while using an index in the execution plan. I’m quite satisfied with the results of the index, the only thing I wish to have improved is the sort performance. There are 23 million rows getting processed for the SELECT statement.

      I’m curious what the execution time would be on your suggestion to have a bit column indicating the last 6 months. I am imagining that the time needed to update the column each month may negligible.

      Lastly, it seems as though this index allows for more flexibility if for example, the filter changes to the last 5 months. This way multiple update queries could be avoided.

Leave a Reply

Your email address will not be published.