hive报错

1、HQL子查询别名问题
报错：hive query failed cannot recognize input near '(' 'select' '*' in joinSource
备注：子查询需要加上别名否则报错。
HQL的书写，select * from (select * from table) ;
应改为：select * from (select * from table) a，执行成功。

2、left join 若后表为前表的基础上进一步查询的结果，不能直接用前表的临时命名，可用with table1 as (select xxxx) 来实现同一查询结果的复用；否则sql语句是执行到哪里，才有一遍结果，故后表查询时，需要把前表查询语句再执行一遍。
报错：table name "a" must be "db.table" format

报错.png

本质：不能写 (select xx from xx where xx) a left join (select yy from a where yy) b
执行到第二个a时，a表并不实际存在！！
解决：
【法一】再写一遍子查询
(select xx from xx where xx) a left join (select yy from (select xx from xx where xx) a where yy) b
【法二】with a as _ 实现同一查询结果多次复用
with a as (select xx from xx where xx)
a left join (select yy from a where yy) b

3、hive使用正则报错
报错：Hive query failed, Error while compiling statement: FAILED: SemanticException [Error 10002]: Line 28:18 Invalid column reference 'median.*'

xxx.png

语句：select a.*, b.`median.*` from a
left join
(
select brand, region
, percentile(rscore, 0.5) as median_r
,percentile(fscore, 0.5) as median_f
,percentile(mscore, 0.5) as median_m
from a
group by brand, region
) b
on a.brand=b.brand and a.region=b.region

b.`median.*` 代表b表中，以“median”开头的字段

解决：需要在代码段前加上：
set hive.support.quoted.identifiers=none;

4、insert overwrite 报错
报错：Caused by: MetaException(message:Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0.NAME = ? AND A0.PART_NAME = ?)
原因：由于partition字段异常字符导致，分区字段不支持中文
根本原因：源表字段值和输出分区值之间的关系是根据位置，而不是命名来匹配的！！！
例子：
insert overwrite table employees
partition (country, state)
select ..., se.cnty, se.st
from staged_employees se;

5、cannot recognize input near 报错
报错：Error: Error while compiling statement: FAILED: ParseException line 10:8 cannot recognize input near '(' 'select' 'statis_id' in joinSource (state=42000,code=40000)
原因：大多数情况是没有对子查询/join的子表 命别名
实际原因：所以我一直纠结在别名的检查上，谁能想到是 变量名写错 了呢……T^T

interval_days变量名写错，为啥报 select statis_id 附近的错啊，这还隔着十万八千里呢T0T.png

interval_days变量名写错！！！.png

6、Error: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
解决1：set mapred.reduce.tasks=10； -> set mapreduce.job.reduces=10; （hive2）

7、Query failed (#20200429_072245_00729_nh2wp): Error opening Hive split s3://bi-oor-com/warehouse/analysis/inter/union_inter_dw.db/dim_mkt_app_info/dayno=29991231/000003_0 (offset=0, length=25082323): Malformed ORC file. Can not read SQL type integer from ORC stream .system_id of type STRING [s3://bi-oor-com/warehouse/analysis/inter/union_inter_dw.db/dim_mkt_app_info/dayno=29991231/000003_0]
解决：select字段过多，不能 select * ，改成指定某几个字段

8、SemanticException Column app_id Found in more than One Tables/Subqueries (state=42000,code=40000)
一般是因为join的时候，含同字段两张表，没有命别名
但我遇到一个诡异的情况：
A union all B
B=aa left join bb left join cc （均已指定别名）
问题出在，aa中，有 not in (select xx_id from A) 的情况，把该条件拿掉后，不再报上述错误

解决：搞不清为啥，直接把B表逻辑with tmp as 了，不再报此错误。

Hive对in的支持有点问题，写出来的sql一直报错，分开跑也报错，一会儿 column more than one tables，一会儿又说语法错误，一会儿又能跑。。。

所以，以后遇到 in / not in 的查询需求，统一用 join 替换，如下：

in:
select A.x_id, A.xxx,……， from (table) A left join (table) B on A.x_id=B.x_id where B.x_id is not null;
not in:
select A.x_id, A.xxx,……， from (table) A left join (table) B on A.x_id=B.x_id where B.x_id is null;

9、hive insert overwrite / into 都没报错，但是数据没写进去
不知道原因>.<
有可能是分区字段不能含NULL，我把分区字段筛选了不为NULL，就好像写进去了？——<验证不是这个问题>
现在怀疑是set hive 某些参数，导致本应有数据，但select出来就没有，所以也就没有insert进去了。——对！！！去掉参数设置就有数据了！！！是参数出了问题！！！有可能是以下几个参数：

set hive.strict.checks.cartesian.product = false;
set hive.mapred.mode = nonstrict;
set hive.support.quoted.identifiers = none;

明天来排查具体是哪个参数的问题。真的太坑了，巨坑，一切的坑都源于省事儿用 in, 用 in 就要求开启这几个配置，开了就忘了关了，就一直写不进去数据（实际上是就没查出来数据）！！！巨坑！！！巨TM坑！！！明天一定要来搞搞清楚！！！

hive报错

推荐阅读更多精彩内容