Recently I encountered a table of astronomic size. The table is 27.7 GB with ~32.8 million rows. The table only had a single index and the query (below) fetches 4 columns using 3 aggregates filtering the last 6 months from GETDATE()
. Below I will explain my methodology and how I was able to cut execution time by 36.996% and subtree cost by 84.703%.
SELECT id1, id2, MAX(desc1), COUNT(id1), MAX(col1) FROM table1 WHERE date1 >= DATEADD(mm, -6, GETDATE()) GROUP BY id1, id2 |
Initially a full table scan was performed because there was no useful index, then a sort. The full table scan cost 38% of the entire execution and the sort cost 59%. The remaining 3% was for parallelism and stream aggregation. Below are the results of execution stats for using a table scan.
Table Scan Performance
Trial 1 | Trial 2 | Trial 3 | Average | |
---|---|---|---|---|
Rows selected by SELECT statements | 46479 | 46479 | 46479 | 46479 |
Bytes received from server | 7322718 | 7307102 | 7309784 | 7313202 |
Client processing time (ms) | 480353 | 477377 | 477780 | 478503.3 |
Total execution time (ms) | 996390 | 976525 | 998812 | 990575.7 |
Wait time on server replies (ms) | 516037 | 499148 | 521032 | 512072.3 |
My initial reaction was to create an indexed view as this would create a physical result set in the database saving overhead. The subset created by the view was about 19.5 GB which is 70% of the base table (again, only concerned with the last 6 months). This was promising, however, when it came time to create an index on this view I ran into some problems. I attempted to create a non-clustered index on the view that was based on the resulting columns, however it required a clustered index first. I was unable to create a clustered index because the resulting subset did not contain any unique columns. This approach failed. In addition, there would be performance issues with this approach when performing CREATE, INSERT, or UPDATE requests.
I then opted to create a new non-clustered index on the base table. This allowed me to create a filter predicate to specify the last 6 months of data, however, without the non-deterministic function (GETDATE()
) that was being used in the query. The last 6 months of data would now be hard coded into the index and I would force an index query hint in the query (...FROM table1 WITH (INDEX(idx)) WHERE date1 > 6 months ago
). The index was based on the date, and included the resulting 4 columns. If you try this approach, make sure that any filter being used in the index matches your query.
While this was the optimal solution, it would require that the index be rebuilt every month to filter the last 6 months from the current date. The rebuilding wouldn’t save anytime overall, therefore the filter predicate was removed from the index. In addition, I used an index with included columns CREATE INDEX index1 ON table1 (col1) INCLUDE (col2, col3)
versus CREATE INDEX index1 ON table1 (col1, col2, col3)
because the included columns are at the leaf level (rather than in the tree) making the index smaller. As this MSDN article explains, only columns that are used for searching should be used as keys.
CREATE INDEX index1 ON table1 (date1) INCLUDE (id1, id2, desc1, col1) |
Index Scan Performance
Trial 1 | Trial 2 | Trial 3 | Average | |
---|---|---|---|---|
Rows selected by SELECT statements | 46479 | 46479 | 46479 | 46479 |
Bytes received from server | 7267378 | 7274764 | 7260844 | 7267662 |
Client processing time (ms) | 491357 | 492806 | 490452 | 491538.3 |
Total execution time (ms) | 628255 | 622815 | 621250 | 624106.7 |
Wait time on server replies (ms) | 136898 | 130009 | 130798 | 132568.3 |
With this approach, an index scan replaces the table scan and costs 9%, the sort 85% and 6% for parallelism and aggregation.
Table vs. Index Scan
Table Scan | Index Scan | |
---|---|---|
Actual Number of Rows | 23107704 | 23107704 |
Estimates I/O Cost | 2712.63 | 405.045 (▼ 2307.585) |
Estimates CPU Cost | 18.2019 | 12.6975 (▼ 5.5044) |
Estimated Operator/Subtree Cost | 2730.84 (38%) | 417.742 (9%) (▼ 84.703%) |
Estimated Number of Rows | 23011900 | 23086200 |
Estimated Row Size | 205 B | 197 B (▼ 8) |
In summary, the index scan cut total execution time by an average 366469 ms (▼ 36.996%) and cut wait time on server replies by an average of 379504 ms (▼ 74.1%).
Definitely the most sensible approach in my opinion.
I have tested materialized views in the past and although the performance is great for SELECTs, as you mentioned other DML statements tends to suffer massively.
Am concerned that you are getting an Index SCAN when you have an index that can qualify for the faster SEEK (on Date) operation. How many rows are being processed as part of that SELECT statement?
You could also get around the rebuilding of the Filtered Index issue by creating the filter on a new (bit?) column created on that table indicating the last six month, which needs to be updated regularly… it saves the hassle of rebuilding the whole index, but requires schema changes to the table itself.
Thanks for your reply.
I am not concerned with the index scan itself while using an index in the execution plan. I’m quite satisfied with the results of the index, the only thing I wish to have improved is the sort performance. There are 23 million rows getting processed for the
SELECT
statement.I’m curious what the execution time would be on your suggestion to have a bit column indicating the last 6 months. I am imagining that the time needed to update the column each month may negligible.
Lastly, it seems as though this index allows for more flexibility if for example, the filter changes to the last 5 months. This way multiple update queries could be avoided.