spark sql udf
In this post, we will continue to look at the cardinality estimation changes in SQL Server 2016. This time we will talk about scalar UDF estimation. Scalar UDFs (sUDF) in SQL Server have quite bad performance and I encourage you try to avoid them in general, however, a lot of systems still use them.
在本文中,我们将继续研究SQL Server 2016中的基数估计更改。这一次,我们将讨论标量UDF估计。 SQL Server中的标量UDF(sUDF)的性能很差,我建议您尝试避免使用它们,但是,许多系统仍在使用它们。
(Scalar UDF Estimation Change)
I’ll use Microsoft sample DB AdventureworksDW2016CTP3 and write the following simple scalar function, it always returns 1, regardless of the input parameter. I run my queries against Microsoft SQL Server 2016 (SP1) (KB3182545) – 13.0.4001.0 (X64)
我将使用Microsoft示例DB AdventureworksDW2016CTP3并编写以下简单的标量函数,无论输入参数如何,该函数始终返回1。 我针对Microsoft SQL Server 2016(SP1)(KB3182545)– 13.0.4001.0(X64)运行查询
use [AdventureworksDW2016CTP3];
go
drop function if exists dbo.uf_simple;
go
create function dbo.uf_simple(@a int)
returns int
with schemabinding
as
begin
return 1;
end
go
Now let’s run two queries, the first one under compatibility level (CL) of SQL Server 2014, the second one under CL 2016 and turn on actual execution plans:
现在让我们运行两个查询,第一个查询在SQL Server 2014的兼容性级别(CL)下,第二个查询在CL 2016下并打开实际的执行计划:
alter database [AdventureworksDW2016CTP3] set compatibility_level = 120;
go
select count_big(*) from dbo.DimDate d where dbo.uf_simple(d.DateKey) = 1;
go
alter database [AdventureworksDW2016CTP3] set compatibility_level = 130;
go
select count_big(*) from dbo.DimDate d where dbo.uf_simple(d.DateKey) = 1;
go
We have got two plans and If we look at them we will see that they are of the same shape, however, if we look at the estimates, we will find some differences in the Filter operator.
我们有两个计划,如果查看它们,我们将看到它们具有相同的形状,但是,如果查看估计值,则会在Filter运算符中发现一些差异。
You may notice that in the first case the estimate is 1 row, in the second case the estimate is 365 rows. Why they are different? The point is that MS has changed the estimation algorithm, i.e. the calculator for sUDF estimation.
您可能会注意到,在第一种情况下,估计值为1行,在第二种情况下,估计值为365行。 为什么它们不同? 关键是,MS已更改了估计算法,即用于sUDF估计的计算器 。
In 2014 it was a CSelCalcPointPredsFreqBased, the calculator for point predicates based on a frequency (Cardinality*Density). DateKey is a PK and it is unique, the frequency is 1. If you multiply the number of rows 3652 by the all_density column (from the dbcc show_statistics(DimDate, PK_DimDate_DateKey) command) 0.0002738226 you will get one row.
在2014年,它是一个CSelCalcPointPredsFreqBased,它是基于频率(基数*密度)的点谓词计算器。 DateKey是一个PK,并且它是唯一的,频率为1。如果将行数3652乘以all_density列(来自dbcc show_statistics(DimDate,PK_DimDate_DateKey)命令)0.0002738226,您将获得一行。
In 2016 the calculator is CSelCalcFixedFilter (0.1) which is a 10% guess of the table cardinality. Our table has 3652 rows and the 10% is 365, which we may observe in the plan.
在2016年,计算器为CSelCalcFixedFilter(0.1),这是表基数的10%的猜测。 我们的表有3652行,其中10%是365,我们可能会在计划中观察到。
This change is described here: FIX: Number of rows is underestimated for a query predicate that involves a scalar user-defined function in SQL Server 2014.
此处描述了此更改: FIX:对于涉及SQL Server 2014中标量用户定义函数的查询谓词,行数被低估了 。
You may wonder, why 2014, if we are talking about 2016? The truth is, that the new Cardinality Estimator (CE) was introduced in 2014 and evolved in 2016, but all the latter fixes for the optimizer in 2014 (introduced by Cumulative Updates (CU) or Service Packs (SP)) are protected by TF 4199 as described here, in 2016 all these fixes are included so you don’t need TF 4199 for them, but if have 2014 you should apply TF 4199 to see the described behavior.
您可能想知道,为什么要说2014年呢? 事实是,新的基数估算器(CE)于2014年推出并于2016年进行了改进,但后者在2014年为优化器提供的所有修复程序(由累积更新(CU)或Service Pack(SP)引入)均受TF保护。 4199所描述的在这里 ,在2016年所有这些补丁也被加入,所以你不需要为他们TF 4199,但如果有2014你应该申请TF 4199看到所描述的行为。
估算难题 (Estimation Puzzle)
Now, let’s modify our queries and replace count with “select *”. Then turn on the statistics IO, actual plans and run them again:
现在,让我们修改查询并将计数替换为“ select *”。 然后打开统计信息IO,实际计划并再次运行它们:
alter database [AdventureworksDW2016CTP3] set compatibility_level = 120;
go
set statistics io on;
select * from dbo.DimDate d where dbo.uf_simple(d.DateKey) = 1;
set statistics io off;
go
alter database [AdventureworksDW2016CTP3] set compatibility_level = 130;
go
set statistics io on;
select * from dbo.DimDate d where dbo.uf_simple(d.DateKey) = 1;
set statistics io off;
go
The results are:
结果是:
Table ‘DimDate’. Scan count 1, logical reads 7312
Table ‘DimDate’. Scan count 1, logical reads 59
表“ DimDate”。 扫描计数1,逻辑读取7312
表“ DimDate”。 扫描计数1,逻辑读取59
That’s a great difference in the logical reads, and we may see why, if we look at the query plans:
这在逻辑读取上有很大的不同,如果我们查看查询计划,我们可能会明白为什么:
If we remember, for the CE 120 it was a one row estimate, and in this case, server decided, that it is cheaper to use a non-clustered index and then make a lookup into clustered. Not very effective if we remember that our predicate returns all rows.
如果我们还记得的话,对于CE 120来说,它是一个单行的估计值,在这种情况下,服务器决定使用非聚集索引并进行聚簇查询会更便宜。 如果我们还记得谓词返回所有行,效果不是很好。
In CE 130 there was a 365 rows estimate, which is too expensive for key lookup and server decided to make a clustered index scan.
在CE 130中,估计有365行,这对于键查找来说太昂贵了,服务器决定进行聚簇索引扫描。
But, wait, what we see is that in the second plan the estimate is also 1 row!
但是,等等,我们看到的是在第二个计划中,估计也是1行!
That fact seemed to me very curious and that’s why I’m writing this post. To find the answer, let’s look in more deep details at how the optimization process goes.
在我看来,这个事实很奇怪,这就是为什么我写这篇文章。 为了找到答案,让我们更深入地了解优化过程如何进行。
说明 (Explanation)
The optimization process is split by phases, before the actual search of the plan alternatives starts, there are a couple of preparation phases, one of them is Project Normalization. During that phase, the optimizer matches computed columns with their definition or deals with other relational projections in some way. For example, it may move a projection around the operator’s tree if necessary.
优化过程分为多个阶段,在开始实际搜索计划替代方案之前,有几个准备阶段,其中一个阶段是项目规范化。 在该阶段,优化器将计算列与其定义匹配,或以某种方式处理其他关系投影。 例如,如果需要,它可以在操作员的树上移动投影。
We may see the trees before and after normalization and their cardinality information with a couple of undocumented TFs 8606 and 8612 applied together with a QUERYTRACEON hint, for instance.
例如,我们可能会看到规范化前后的树及其基数信息,以及一些未记录的TF 8606和8612以及QUERYTRACEON提示。
For the 2014 CL the relational Select (LogOp_Select) cardinality (which represents a Filter operator in query plan) before project normalization is 1 row:
对于2014 CL,项目规范化之前的关系Select(LogOp_Select)基数(代表查询计划中的Filter运算符)为1行:
*** Tree Before Project Normalization ***
LogOp_Select [ Card=1 ]
LogOp_Get TBL: dbo.DimDate … [ Card=3652 ]
ScaOp_Comp x_cmpEq
ScaOp_Udf dbo.uf_simple IsDet
ScaOp_Identifier QCOL: [d].DateKey
ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=1)
For 2016 CL, before project normalization, it is 365.2 rows:
对于2016 CL,在项目规范化之前为365.2行:
*** Tree Before Project Normalization ***
LogOp_Select [ Card=365.2 ]
LogOp_Get TBL: dbo.DimDate … [ Card=3652 ]
ScaOp_Comp x_cmpEq
ScaOp_Udf dbo.uf_simple IsDet
ScaOp_Identifier QCOL: [d].DateKey
ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=1)
Which is totally understandable, because we remember, that there is a cardinality estimation change to 10% guess for sUDFs in 2016 CL. However, this estimate is not what we see in the final query plan for the second query, we see 1 row there. Project Normalization is the place where this estimation is introduced.
这是完全可以理解的,因为我们记得,在2016 CL中,sUDF的基数估计变化为10%。 但是,此估算值不是我们在第二个查询的最终查询计划中看到的,我们在那里看到1行。 项目规范化是引入此估计的地方。
If we examine the trees after project normalization, we see the following picture (the tree is the same for both queries):
如果我们在项目规范化之后检查树,我们将看到以下图片(两个查询的树都相同):
*** Tree After Project Normalization ***
LogOp_Select
LogOp_Project
LogOp_Get TBL: dbo.DimDate … [ Card=3652 ]
AncOp_PrjList
AncOp_PrjEl COL: Expr1001
ScaOp_Udf dbo.uf_simple IsDet
ScaOp_Identifier QCOL: [d].DateKey
ScaOp_Comp x_cmpEq
ScaOp_Identifier COL: Expr1001
ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=1)
You may notice a new Project operator, that converts our sUDF uf_simple to an expression Expr1001, projects it further to the tree upper node and over this projection, the relational Select should filter out the rows, i.e. we are now filtering on the expression, not on the sUDF directly.
您可能会注意到一个新的Project运算符,它将我们的sUDF uf_simple转换为表达式Expr1001,将其进一步投影到树的上层节点,并在该投影之上,关系Select应该过滤掉行,即我们现在在表达式上进行过滤,而不是直接在sUDF上。
The optimizer doesn’t know the cardinality for that new Select operator and the estimation process starts. The thing is that filtering over such an expression is unchanged both under 2014 CL and 2016 CL– it still uses CSelCalcPointPredsFreqBased calculator and the result is the same – 1 row. We may see the result of this cardinality estimation of the tree after Project Normalization with a TF 2363. Both statistics trees for both queries have the same shape and estimate:
优化器不知道该新Select运算符的基数,并且估计过程开始。 事实是,在2014 CL和2016 CL下对此类表达式进行过滤都不会更改–它仍使用CSelCalcPointPredsFreqBased计算器,结果是相同的-1行。 我们可以在使用TF 2363进行项目规范化之后看到树的基数估计结果。两个查询的两个统计树的形状和估计都相同:
CStCollFilter(ID=4, CARD=1)
CStCollProject(ID=3, CARD=3652)
CStCollBaseTable(ID=1, CARD=3652 TBL: dbo.DimDate AS TBL: d)
Then the optimization process starts to search different alternatives and stores them in a Memo structure, internal structure to store plan alternatives (I described it a couple of years ago in my Russian blog). For the CL 2016 – the sUDF estimation change of 10% guess plays its role during that search, the predicate is estimated as 365 rows and the plan shape with Clustered Index Scan is selected, however, this plan alternative goes to the Memo group which has the cardinality estimated to 1 row, during the very first Project Normalization phase. For the CL 2014 no surprise if the estimate both for sUDF and Predicate over expression – is 1 row, so the plan with lookup is selected.
然后,优化过程开始搜索不同的替代方案,并将它们存储在Memo结构中,该内部结构用于存储计划替代方案(我在几年前在我的俄语博客中对此进行了描述)。 对于CL 2016 –在搜索过程中,sUDF估计值的变化会发生10%的猜测,该谓词估计为365行,并且选择了具有聚簇索引扫描的计划形状,但是,该计划替代方案属于Memo组,该组具有在第一个项目规范化阶段,基数估计为1行。 对于CL 2014,如果sUDF和Predicate over expression的估计都为1行,就不足为奇了,因此选择了带查找的计划。
You may observe different predicates in the query plans also. For the 2014 CL the predicate is inside the Filter.
您可能还会在查询计划中观察到不同的谓词。 对于2014 CL,谓词位于筛选器内。
For 2016 CL the sUDF is computed as a separate Compute Scalar and the Filter is on the Expr1001 predicate.
对于2016 CL,sUDF作为单独的计算标量进行计算,并且过滤器位于Expr1001谓词上。
There is an undocumented TF 9259 to disable a project normalization phase, let’s re-run our query with this TF.
有一个未记录的TF 9259可以禁用项目规范化阶段,让我们使用此TF重新运行查询。
alter database [AdventureworksDW2016CTP3] set compatibility_level = 130;
go
select * from dbo.DimDate d where dbo.uf_simple(d.DateKey) = 1 option(querytraceon 9259);
go
The estimate is now 365.2 which is much more clearly explains, why a server decided to choose a Clustered Index Scan instead of Index Scan + Lookup.
现在的估计值为365.2,这更清楚地解释了为什么服务器决定选择群集索引扫描而不是索引扫描+查找。
Microsoft is aware of this situation and considers it to be normal, I would agree with them, but that one row estimate combined with Clustered Index Scan puzzled me and I decided to write about it.
Microsoft意识到这种情况,并认为这是正常的,我会同意他们的看法,但是那种将行估计与Clustered Index Scan相结合的想法令我感到困惑,于是我决定写这篇文章。
结论 (Conclusion)
In 2016 (as well as in 2014 + TF 4199 and latest SPs or CUs) there is a cardinality estimation change in sUDFs estimation – the old version uses the density from base statistics and the new version uses 10% guess. The estimation for the expression predicates over sUDFs are not changed. Sometimes you may see little artifacts of project normalization in a query plan, but that shouldn’t be a problem.
在2016年(以及2014年+ TF 4199和最新的SP或CU)中,sUDF估计的基数估计发生了变化–旧版本使用基本统计数据的密度,而新版本使用10%的猜测。 对于sUDF的表达式谓词的估计不会更改。 有时,您可能在查询计划中看不到项目规范化的小产物,但这不应该成为问题。
Both of the estimations, in 2014 and 2016, are guesses, because sUDF is a black box for the optimizer (and also not good in many other ways), so avoid using it in general, especially in predicates.
2014年和2016年的两个估计都是猜测,因为sUDF是优化程序的黑匣子(并且在许多其他方面也不是很理想),因此应避免在一般情况下使用它,尤其是在谓词中。
(Note)
Please, don’t use TF 9259 that disables Project Normalization step in a real production system, besides it is undocumented and unsupported, it may hurt your performance. Consider the following example with computed columns.
请不要在真实的生产系统中使用禁用项目规范化步骤的TF 9259,除非它没有文档记录和不受支持,否则可能会损害您的性能。 考虑以下带有计算列的示例。
use AdventureworksDW2016CTP3;
go
alter table dbo.DimDate add NewDateKey as DateKey*1;
create nonclustered index ix_NewDateKey on dbo.DimDate(NewDateKey);
go
set statistics xml on;
select count_big(*) from dbo.DimDate where NewDateKey = 1;
select count_big(*) from dbo.DimDate where NewDateKey = 1 option(querytraceon 9259);
set statistics xml off;
go
drop index ix_NewDateKey on dbo.DimDate;
alter table dbo.DimDate drop column NewDateKey;
The query plans are Index Seek in the first case and Index Scan in the second one.
查询计划的第一种情况是“索引搜索”,第二种情况是“索引扫描”。
Thank you for reading!
感谢您的阅读!
翻译自: https://www.sqlshack.com/scalar-udf-estimation-and-project-normalization/
spark sql udf