fact_table_optimized = select concat(key,'_', floor(rand(123456)*19)) as salted_key, val from (select * from ${model_tmp_db}.ip_snapshot_filter where dt='')t
dimension_optimized= select dim_key,val, explode(array(0,1,23,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key from (select cust_id, from ${model_tmp_db}.ip_increment_filter
where dt=xxx) t
fact_table 生成随机后缀(floor(rand(123456)*19)))key_随机后缀
dimension table 单独生成随机后缀。扩大N 倍
将Join key 变成加盐后的Key 这样skew 的key 也不skew 了.
Problem
select t.asset_value, t.cnt from (select asset_value,count(asset_value) as cnt from bsl2019prod_tmp.ip_snapshot_filter where dt='2020-10-28' group by asset_value having cnt>1000)t order by t.cnt desc limit 500;
最大的t.cnt 是最小的30倍· skew 的一塌糊涂
t.asset_value t.cnt
xxxxx 31522
xxxx 1209
Code
fact_table_optimized = select concat(t.asset_value,'_', floor(rand(123456)*19)) as salted_key, t.cust_id,t.last_date from (select * from bsl2019prod_tmp.ip_snapshot_filter where dt='2020-10-28')t;
dimension_optimized= select t.cust_id,t.asset_value as key,t.last_date, explode(array(0,1,23,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key from (select cust_id, asset_value,last_date from bsl2019prod_tmp.ip_increment_filter where dt='2020-10-28') t;
joinret =select
a.cust_id as cust_id_origin,
b.cust_id as cust_id_target,
split(a.salted_key,'_')[0] as asset_value,
a.last_date AS last_date_origin,
b.last_date AS last_date_target,
a.last_date AS link_crtd_ts from fact_table_optimized a, dimension_optimized b where a.salted_key=concat(b.key,'_',b.salted_key);
insert joinret into table xxx.joinret location '/tmp/joinret';
Let's see whether skew or not , 一点也不skew , :) x N