在我们的工作中,将ElasticSearch当做全文检索引擎来使用,同时为用户和后台提供服务。版本是比较老旧的2.3.2。
最近接到一个优化需求:在检索单个中文字符时,能够匹配包含该单字的文档;在检索词语时,就不按单字进行匹配。也就是说以商品为例,如果搜索“酒”字,能够匹配到关于“啤酒”“白酒”“红酒”等所有的文档;但如果搜索“啤酒”词语,就只匹配“啤酒”。另外,在匹配时,能够全文匹配的结果排在前面,包含分词匹配的结果排在后面,并且要按匹配度与销量来排序。
最初想到的办法是,对有这种需求的字段,索引与检索时采用不同的IK analyzer。索引时做最细粒度分词,检索时则用智能分词。即设置mapping时如下:
~ curl -s -H 'Content-Type:application/json' \
-XPUT 'es0:9200/index/_mapping/type?pretty=true' -d '{
"properties": {
"productTitle": {
"type": "string",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}'
但是,就算采用ik_max_word,有很多单字也是分不出来的。因此,我们在自定义词典中添加了一个单字字典,大约有12000个单字。这样再采用ik_max_word分词,单字都会被切分出来。如:
~ curl -s -H 'Content-Type: application/json' \
-XGET 'es0:9200/_analyze?pretty' -d '{
"analyzer" : "ik_max_word",
"text": "中华人民共和国"
}'
{
"tokens" : [ {
"token" : "中华人民共和国",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "中华人民",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
}, {
"token" : "中华",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "中",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "华人",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 4
}, {
"token" : "华",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 5
}, {
"token" : "人民共和国",
"start_offset" : 2,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 6
}, {
"token" : "人民",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 7
}, {
"token" : "人",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 8
}, {
"token" : "民",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 9
}, {
"token" : "共和国",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 10
}, {
"token" : "共和",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 11
}, {
"token" : "共",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 12
}, {
"token" : "和",
"start_offset" : 5,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 13
}, {
"token" : "国",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 14
} ]
}
单字字典其实在ES IK插件中就有提供,见https://github.com/medcl/elasticsearch-analysis-ik/blob/master/config/extra_single_word_full.dic。
在IKAnalyzer.cfg.xml中加入单字字典后,重启ES生效(我们没有做热更新,惭愧惭愧)。
至于后来提到的排序规则就相对简单了,只需要让term query的优先级高于match-phrase query,用boost/slop可以轻易实现:
BoolQueryBuilder currentBuilder = QueryBuilders.boolQuery();
currentBuilder.should(QueryBuilders.termQuery("productTitle", keyword).boost(6.5f));
currentBuilder.should(QueryBuilders.matchPhraseQuery("productTitle", keyword).slop(4).boost(2.5f));
// ......
requestBuilder.setFrom(start).setSize(limit);
requestBuilder.addSort(SortBuilders.scoreSort().order(SortOrder.DESC));
requestBuilder.addSort(SortBuilders.fieldSort("soldNum").order(SortOrder.DESC));