在本文中,您将了解.
1. exists and in 子查询的在mysql中的区别。
2. exists and in 子查询在sparksql中的实现。
3. not exists and not in 子查询在sparksql中的实现。
4. 什麽是Nested loop join 和它的适用范围。
5. 一个例子显示两个子查询在生产环境中spark sql 上的性能差异。
1. difference between exists and in subquery in mysql
1. 当使用in子查询的时候。可以对外表和内表做索引。
2. 当使用exists子查询的时候,只对内表做索引。
3. 在这里区别就是是否对外表做索引, 做索引花的时间是否比内存中查询要快。如果外表数据很小。显然做索引是不值得的。如果外表很大,做索引是值得的。这也解释了爲什麽外表大,推荐用in子查询,外表小,推荐用exists.
in subquery
select A.key from A.key in (select B.key from B)
exists subquery
select A.key where exists (select * from B where A.key = B.key)
2. in subquery and exists subquery in Spark SQL implementation
比较下两者的physical plan
in subquery
explain select a.key1 from testdata1 as a where a.key1 in (select key3 from testdata3 as b);
== Physical Plan ==
*(1) Project [key1#52]
+- *(1) BroadcastHashJoin [key1#52], [key3#54], LeftSemi, BuildRight
:- HiveTableScan [key1#52], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#52, value1#53]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- HiveTableScan [key3#54], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#54, value3#55]
Time taken: 0.062 seconds, Fetched 1 row(s)
exists query
explain select a.key1 from testdata1 as a where exists (select * from testdata3 as b where a.key1=b.key3);
== Physical Plan ==
*(1) Project [key1#15]
+- *(1) BroadcastHashJoin [key1#15], [key3#17], LeftSemi, BuildRight
:- HiveTableScan [key1#15], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#15, value1#16]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- HiveTableScan [key3#17], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#17, value3#18]
Time taken: 0.347 seconds, Fetched 1 row(s)
兩個的spark plan 是一樣的。
3. not in subquery and not exists subquery in Spark SQL implementation
not in subquery
explain select a.key1 from testdata1 as a where a.key1 not in (select key3 from testdata3 as b);
== Physical Plan ==
*(1) Project [key1#66]
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#66 = key3#68) || isnull((key1#66 = key3#68)))
:- HiveTableScan [key1#66], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#66, value1#67]
+- BroadcastExchange IdentityBroadcastMode
+- HiveTableScan [key3#68], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#68, value3#69]
```
not exists subquery
explain select a.key1 from testdata1 as a where not exists (select * from testdata3 as b where a.key1=b.key3);
== Physical Plan ==
*(1) Project [key1#73]
+- *(1) BroadcastHashJoin [key1#73], [key3#75], LeftAnti, BuildRight
:- HiveTableScan [key1#73], HiveTableRelation `default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#73, value1#74]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- HiveTableScan [key3#75], HiveTableRelation `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#75, value3#76]
```
因爲spark使用broadcastnestedloopJoin去实现not in 子查询,而使用broadcasthashjoin实现not exists子查询。后面会解释broadcastnestedloopJoin爲什麽是个性能不太友好的join算法,因爲它总共要要在内存里遍历count(A)*count(B)遍, 而在实际生产中, count(A)*count(B)是个很大的数字。 并且not in 子查询我们是没有办法通过参数优化改变它的spark plan.
4. Nested loop join and which case is suitable
嵌套循环连接通常被定义为sql中最基本的连接算法。在伪代码中,它可以实现为:
O_SET # outer query set
I_SET # inner query set
PREDICATE # join predicate
foreach o_row in O_SET:
foreach i_row in I_SET:
if i_row matches PREDICATE:
return (o_row, i_row)
broadcastnestedloopJoin性能并不友好,因为它需要将外表和内表加载到内存中,并将外表中的记录与内表中的记录进行比较。它适用于数据量很小的表。
5. An example to show the significant difference between (not in subquery) and (not exists subquery)
显示查询挂起时间长达39min,实际挂起时间长达15h。
select a.* from A as a where a.tracking_number not in (select b.tracking_number from B as b)
在我们修改sql之后,它用了1.1分钟来完成。原因是使用Broadcasthash join而不是broadcastnestedloopJoin
select a.*from A as a where a.tracking_number not exists (select b.tracking_number from B as b);
Conclusion
1.何时在spark 使用in子查詢,exists子查詢?
都可以
2.何时在spark 使用not in子查詢,not exists子查詢?
建议使用not exists 子查询因爲 not in 子查询使用BroadcastNestedLoopJoin 这种性能不友好的算法实现,只适用数据量很小的情况,对生产环境不适合。