搜索内容入库

原创：郑佳伟

最近在做搜索推荐的东西，所以整理一些相关的内容和大家分享。

在搜索中有个很重要的工具叫ElasticSearch简称ES,这个工具主要用来做倒排索引，也就是根据想要搜索的内容在ES中进行查找，返回对应的文章，至于为什么不在数据库中查找，因为在数据库中查找速度太慢，具体原因可以参考 《为什么需要 Elasticsearch》 https://zhuanlan.zhihu.com/p/73585202。

在正常的应用中，数据是存放在数据库中的，而在ES中查找，则需要将数据库中的内容导入的ES中，文章则介绍一种工具Monstache，并且会将Monstache的安装和使用方法，docker和docker-compose分享出来，让大家直接使用。

搜索内容入库步骤

Mongo中的数据同步到ElasticSearch(简称ES)
在ES中重新建立索引

前期准备

创建Mongo副本集 docker 部署 MongoDB 副本集
ES环境搭建+分词器安装

方法：获取镜像(包含ES7.5和hanlp分词器)

注意: ES版本和hanlp版本需对应

tomczhen/elasticsearch-with-hanlp:7.5.0 #镜像名称

Mongo中的数据同步到ElasticSearch

简单方法

这个过程比较繁琐，为了照顾没有耐心看完文章的小伙伴，所以在文章开始，直接给出docker-compose:

安装docker-compose可以参考lianglin：docker-compose的安装，使用的话可以参考纯洁的微笑：Docker(四)：Docker 三剑客之 Docker Compose

version: '3.7'

services:

   monstache:
    #镜像来源
    image: zhengjiawei001/monstache
    container_name: zjw_monstache
    restart: always
        command: bash -c "source /etc/profile && cd /usr/local/monstache && monstache --mongo-url mongodb://10.30.89.124:27011 --elasticsearch-url http://10.30.89.124:9200 -f config.toml"
networks:
  default:
    external:
            name: serving-database_default

注1:可以通过命令行的方法在 config.toml中添加参数，例如 --mongo-url mongodb://10.30.89.124:27011 添加mongo的地址，--elasticsearch-url http://10.30.89.124:9200 添加ES的地址。

所有的参数说明：
monstache_1  | Usage of monstache:
monstache_1  |   -change-stream-namespace value
monstache_1  |      A list of change stream namespaces
monstache_1  |   -cluster-name string
monstache_1  |      Name of the monstache process cluster
monstache_1  |   -config-database-name string
monstache_1  |      The MongoDB database name that monstache uses to store metadata
monstache_1  |   -debug
monstache_1  |      True to enable verbose debug information
monstache_1  |   -delete-index-pattern string
monstache_1  |      An Elasticsearch index-pattern to restric the scope of stateless deletes
monstache_1  |   -delete-strategy value
monstache_1  |      Stategy to use for deletes. 0=stateless,1=stateful,2=ignore
monstache_1  |   -direct-read-bounded
monstache_1  |      True to limit direct reads to the docs present at query start time
monstache_1  |   -direct-read-concur int
monstache_1  |      Max number of direct-read-namespaces to read concurrently. By default all givne are read concurrently
monstache_1  |   -direct-read-dynamic-exclude-regex string
monstache_1  |      A regex to use for excluding namespaces when using dynamic direct reads
monstache_1  |   -direct-read-dynamic-include-regex string
monstache_1  |      A regex to use for including namespaces when using dynamic direct reads
monstache_1  |   -direct-read-namespace value
monstache_1  |      A list of direct read namespaces
monstache_1  |   -direct-read-no-timeout
monstache_1  |      True to set the no cursor timeout flag for direct reads
monstache_1  |   -direct-read-split-max int
monstache_1  |      Max number of times to split a collection for direct reads
monstache_1  |   -direct-read-stateful
monstache_1  |      True to mark direct read namespaces as complete and not sync them in future runs
monstache_1  |   -disable-change-events
monstache_1  |      True to disable listening for changes.  You must provide direct-reads in this case
monstache_1  |   -disable-delete-protection
monstache_1  |      True to disable delete protection and allow multiple deletes in Elasticsearch per event in MongoDB
monstache_1  |   -disable-file-pipeline-put
monstache_1  |      True to disable auto-creation of the ingest plugin pipeline
monstache_1  |   -dropped-collections
monstache_1  |      True to delete indexes from dropped collections (default true)
monstache_1  |   -dropped-databases
monstache_1  |      True to delete indexes from dropped databases (default true)
monstache_1  |   -elasticsearch-client-timeout int
monstache_1  |      Number of seconds before a request to Elasticsearch is timed out
monstache_1  |   -elasticsearch-max-bytes int
monstache_1  |      Number of bytes to hold before flushing to Elasticsearch
monstache_1  |   -elasticsearch-max-conns int
monstache_1  |      Elasticsearch max connections
monstache_1  |   -elasticsearch-max-docs int
monstache_1  |      Number of docs to hold before flushing to Elasticsearch
monstache_1  |   -elasticsearch-max-seconds int
monstache_1  |      Number of seconds before flushing to Elasticsearch
monstache_1  |   -elasticsearch-password string
monstache_1  |      The elasticsearch password for basic auth
monstache_1  |   -elasticsearch-pem-file string
monstache_1  |      Path to a PEM file for secure connections to elasticsearch
monstache_1  |   -elasticsearch-retry
monstache_1  |      True to retry failed request to Elasticsearch
monstache_1  |   -elasticsearch-url value
monstache_1  |      A list of Elasticsearch URLs
monstache_1  |   -elasticsearch-user string
monstache_1  |      The elasticsearch user name for basic auth
monstache_1  |   -elasticsearch-validate-pem-file
monstache_1  |      Set to boolean false to not validate the Elasticsearch PEM file (default true)
monstache_1  |   -elasticsearch-version string
monstache_1  |      Specify elasticsearch version directly instead of getting it from the server
monstache_1  |   -enable-easy-json
monstache_1  |      True to enable easy-json serialization
monstache_1  |   -enable-http-server
monstache_1  |      True to enable an internal http server
monstache_1  |   -enable-oplog
monstache_1  |      True to enable direct tailing of the oplog
monstache_1  |   -enable-patches
monstache_1  |      True to include an json-patch field on updates
monstache_1  |   -env-delimiter string
monstache_1  |      A delimiter to use when splitting environment variable values (default ",")
monstache_1  |   -exit-after-direct-reads
monstache_1  |      True to exit the program after reading directly from the configured namespaces
monstache_1  |   -f string
monstache_1  |      Location of configuration file
monstache_1  |   -fail-fast
monstache_1  |      True to exit if a single _bulk request fails
monstache_1  |   -file-downloaders int
monstache_1  |      GridFs download go routines
monstache_1  |   -file-highlighting
monstache_1  |      True to enable the ability to highlight search times for a file query
monstache_1  |   -file-namespace value
monstache_1  |      A list of file namespaces
monstache_1  |   -graylog-addr string
monstache_1  |      Send logs to a Graylog server at this address
monstache_1  |   -gzip
monstache_1  |      True to enable gzip for requests to Elasticsearch
monstache_1  |   -http-server-addr string
monstache_1  |      The address the internal http server listens on
monstache_1  |   -index-as-update
monstache_1  |      True to index documents as updates instead of overwrites
monstache_1  |   -index-files
monstache_1  |      True to index gridfs files into elasticsearch. Requires the elasticsearch mapper-attachments (deprecated) or ingest-attachment plugin
monstache_1  |   -index-oplog-time
monstache_1  |      True to add date/time information from the oplog to each document when indexing
monstache_1  |   -index-stats
monstache_1  |      True to index stats in elasticsearch
monstache_1  |   -mapper-plugin-path string
monstache_1  |      The path to a .so file to load as a document mapper plugin
monstache_1  |   -max-file-size int
monstache_1  |      GridFs file content exceeding this limit in bytes will not be indexed in Elasticsearch
monstache_1  |   -merge-patch-attribute string
monstache_1  |      Attribute to store json-patch values under
monstache_1  |   -mongo-config-url string
monstache_1  |      MongoDB config server connection URL
monstache_1  |   -mongo-oplog-collection-name string
monstache_1  |      Override the collection name which contains the mongodb oplog
monstache_1  |   -mongo-oplog-database-name string
monstache_1  |      Override the database name which contains the mongodb oplog
monstache_1  |   -mongo-url string
monstache_1  |      MongoDB server or router server connection URL
monstache_1  |   -namespace-drop-exclude-regex string
monstache_1  |      A regex which is matched against a drop operation's namespace (<database>.<collection>).  Only drop operations which do not match are synched to elasticsearch
monstache_1  |   -namespace-drop-regex string
monstache_1  |      A regex which is matched against a drop operation's namespace (<database>.<collection>).  Only drop operations which match are synched to elasticsearch
monstache_1  |   -namespace-exclude-regex string
monstache_1  |      A regex which is matched against an operation's namespace (<database>.<collection>).  Only operations which do not match are synched to elasticsearch
monstache_1  |   -namespace-regex string
monstache_1  |      A regex which is matched against an operation's namespace (<database>.<collection>).  Only operations which match are synched to elasticsearch
monstache_1  |   -oplog-date-field-format string
monstache_1  |      Format to use for the oplog date
monstache_1  |   -oplog-date-field-name string
monstache_1  |      Field name to use for the oplog date
monstache_1  |   -oplog-ts-field-name string
monstache_1  |      Field name to use for the oplog timestamp
monstache_1  |   -patch-namespace value
monstache_1  |      A list of patch namespaces
monstache_1  |   -pipe-allow-disk
monstache_1  |      True to allow MongoDB to use the disk for pipeline options with lots of results
monstache_1  |   -post-processors int
monstache_1  |      Number of post-processing go routines
monstache_1  |   -pprof
monstache_1  |      True to enable pprof endpoints
monstache_1  |   -print-config
monstache_1  |      Print the configuration and then exit
monstache_1  |   -prune-invalid-json
monstache_1  |      True to omit values which do not serialize to JSON such as +Inf and -Inf and thus cause errors
monstache_1  |   -relate-buffer int
monstache_1  |      Number of relates to queue before skipping and reporting an error
monstache_1  |   -relate-threads int
monstache_1  |      Number of threads dedicated to processing relationships
monstache_1  |   -replay
monstache_1  |      True to replay all events from the oplog and index them in elasticsearch
monstache_1  |   -resume
monstache_1  |      True to capture the last timestamp of this run and resume on a subsequent run
monstache_1  |   -resume-from-earliest-timestamp
monstache_1  |      Automatically select an earliest timestamp to resume syncing from
monstache_1  |   -resume-from-timestamp int
monstache_1  |      Timestamp to resume syncing from
monstache_1  |   -resume-name string
monstache_1  |      Name under which to load/store the resume state. Defaults to 'default'
monstache_1  |   -resume-strategy value
monstache_1  |      Strategy to use for resuming. 0=timestamp,1=token
monstache_1  |   -resume-write-unsafe
monstache_1  |      True to speedup writes of the last timestamp synched for resuming at the cost of error checking
monstache_1  |   -routing-namespace value
monstache_1  |      A list of namespaces that override routing information
monstache_1  |   -stats
monstache_1  |      True to print out statistics
monstache_1  |   -stats-duration string
monstache_1  |      The duration after which stats are logged
monstache_1  |   -stats-index-format string
monstache_1  |      time.Time supported format to use for the stats index names
monstache_1  |   -time-machine-direct-reads
monstache_1  |      True to index the results of direct reads into the any time machine indexes
monstache_1  |   -time-machine-index-prefix string
monstache_1  |      A prefix to preprend to time machine indexes
monstache_1  |   -time-machine-index-suffix string
monstache_1  |      A suffix to append to time machine indexes
monstache_1  |   -time-machine-namespace value
monstache_1  |      A list of direct read namespaces
monstache_1  |   -tpl
monstache_1  |      True to interpret the config file as a template
monstache_1  |   -v True to print the version number
monstache_1  |   -verbose
monstache_1  |      True to output verbose messages
monstache_1  |   -worker string
monstache_1  |      The name of this worker in a multi-worker configuration
monstache_1  |   -workers value
monstache_1  |      A list of worker names

其中还有一个非常重要的参数是direct-read-namespaces，用来制定待同步的集合,因为这个镜像是根据自己的需求打包的，默认同步的是zhihu_new数据库中zhuanlan和articles两张表)

注2:参考文章里面ES的地址参数是--elasticsearch-urls，但现在使用的是--elasticsearch-url，可能是版本问题导致的。

详细方法

如果想要自己搭建实现，按照以下步骤操作即可：

一、搭建Monstache环境

1.安装Go, 并配置环境变量

（1）下载Go安装包并解压

wget <https://dl.google.com/go/go1.14.4.linux-amd64.tar.gz>
tar -C /usr/local -xzf go1.14.4.linux-amd64.tar.gz

（2）配置环境变量

使用vim /etc/profile命令打开环境变量配置文件，并将如下内容写入该文件中。其中GOPROXY用来指定阿里云Go模块代理。

export GOROOT=/usr/local/go
export GOPATH=/home/go/
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin
export GOPROXY=https://mirrors.aliyun.com/goproxy/

（3）应用环境变量配置

source /etc/profile

2.安装Monstache

（1）进入安装路径

cd /usr/local/

（2）从Git库中下载安装包

git clone <https://github.com/rwynn/monstache.git>

note: 如果github下载比较难，可以使用码云，下载会容易很多

（3）进入monstache目录

cd monstache

（4）切换版本

文档以 rel5 版本为例

git checkout rel5

（5）安装Monstache

go install

（6）查看Monstache版本

monstache -v


# 期望输出
# 5.5.5

二、将Mongo中的数据同步到ElasticSearch

方法：使用Monstache实时同步

（1）进入Monstache安装目录，打开配置文件

cd /usr/local/monstache/
vim config.toml

（2）参考以下示例，修改配置文件

note：

=mongourl，elasticsearch-urls, direct-read-namespaces可能需要更改
若 direct-read-namespaces 中的内容有更新，则需同步修改[[mapping]]

# connection settings


# connect to MongoDB using the following URL
mongo-url = "mongodb://10.30.89.124:27011"   # 需更新
# connect to the Elasticsearch REST API at the following node URLs
elasticsearch-urls = ["<http://10.30.89.124:9200>"]  # 需更新


# frequently required settings


# if you need to seed an index from a collection and not just listen and sync changes events
# you can copy entire collections or views from MongoDB to Elasticsearch
direct-read-namespaces = ["zhihu_new.zhuanlan","zhihu_new.articles"] #需更新


# if you want to use MongoDB change streams instead of legacy oplog tailing use change-stream-namespaces
# change streams require at least MongoDB API 3.6+
# if you have MongoDB 4+ you can listen for changes to an entire database or entire deployment
# in this case you usually don't need regexes in your config to filter collections unless you target the deployment.
# to listen to an entire db use only the database name.  For a deployment use an empty string.
#change-stream-namespaces = ["mydb.col"]


# additional settings


# if you don't want to listen for changes to all collections in MongoDB but only a few
# e.g. only listen for inserts, updates, deletes, and drops from mydb.mycollection
# this setting does not initiate a copy, it is only a filter on the change event listener
#namespace-regex = '^mydb\\.col$'
# compress requests to Elasticsearch
#gzip = true
# generate indexing statistics
#stats = true
# index statistics into Elasticsearch
#index-stats = true
# use the following PEM file for connections to MongoDB
#mongo-pem-file = "/path/to/mongoCert.pem"
# disable PEM validation
#mongo-validate-pem-file = false
# use the following user name for Elasticsearch basic auth
elasticsearch-user = "elastic"
# use the following password for Elasticsearch basic auth
elasticsearch-password = "<your_es_password>"
# use 4 go routines concurrently pushing documents to Elasticsearch
elasticsearch-max-conns = 4
# use the following PEM file to connections to Elasticsearch
#elasticsearch-pem-file = "/path/to/elasticCert.pem"
# validate connections to Elasticsearch
#elastic-validate-pem-file = true
# propogate dropped collections in MongoDB as index deletes in Elasticsearch
dropped-collections = true
# propogate dropped databases in MongoDB as index deletes in Elasticsearch
dropped-databases = true
# do not start processing at the beginning of the MongoDB oplog
# if you set the replay to true you may see version conflict messages
# in the log if you had synced previously. This just means that you are replaying old docs which are already
# in Elasticsearch with a newer version. Elasticsearch is preventing the old docs from overwriting new ones.
#replay = false
# resume processing from a timestamp saved in a previous run
resume = true
# do not validate that progress timestamps have been saved
#resume-write-unsafe = false
# override the name under which resume state is saved
#resume-name = "default"
# use a custom resume strategy (tokens) instead of the default strategy (timestamps)
# tokens work with MongoDB API 3.6+ while timestamps work only with MongoDB API 4.0+
resume-strategy = 0
# exclude documents whose namespace matches the following pattern
#namespace-exclude-regex = '^mydb\\.ignorecollection$'
# turn on indexing of GridFS file content
#index-files = true
# turn on search result highlighting of GridFS content
#file-highlighting = true
# index GridFS files inserted into the following collections
#file-namespaces = ["users.fs.files"]
# print detailed information including request traces
verbose = true
# enable clustering mode
cluster-name = 'es-cn-mp91kzb8m00******'
# do not exit after full-sync, rather continue tailing the oplog
#exit-after-direct-reads = false
[[mapping]] # mapping 内容需更新
namespace = "zhihu_new.articles"
index = "articles"
type = "collection"


[[mapping]] # mapping 内容需更新
namespace = "zhihu_new.zhuanlan"
index = "zhuanlan"
type = "collection"

具体参数可参考下表：

（3）运行Monstache，使Mongo数据与ES同步

monstache -f config.toml

#说明：通过-f参数，您可以显式运行Monstache，系统会打印所有调试日志（包括对ES的请求追踪）。
# 在迁移数据过程中，时间消耗：0.303G ： 13分钟左右

（4）参考镜像（可选）

为了简化上面的安装步骤，可以直接使用打包好的镜像

镜像名称：zhengjiawei001/monstache
docker run -it --network=serving-database_default zhengjiawei001/monstache /bin/bash
# network 需和 mongo 的网桥相同(文档事示例为 serving-database_default)
进入容器后：
source /etc/profile # 应用环境变量配置 
cd /usr/local/monstache/ #进入monstache目录 
vim config.toml #在配置文件中修改es的地址和mongo的地址以及要迁移的数据库 
monstache -f config.toml #运行Monstache

在ES中重新建立索引

使用工具：kibana

ES默认索引是字索引，现改为词索引。

（1）重建索引(增加分词器)

GET zhuanlan/_mapping  #查看之前的索引 并复制


PUT zhuanlan_new  #在type :'text'的后面，都指定分词器是hanlp。"analyzer" : "hanlp"
{
    "mappings" : {
      "properties" : {
        "accept_submission" : {
          "type" : "boolean"
        },
        "articles_count" : {
          "type" : "long"
        },
        "column_type" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "analyzer" : "hanlp"
        },
        "comment_permission" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "analyzer" : "hanlp"
        },
        "created" : {
          "type" : "long"
        },
        "description" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "analyzer" : "hanlp"
        },
        "followers" : {
          "type" : "long"
        },
        "id" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "image_url" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "intro" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "analyzer" : "hanlp"
        },
        "type" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "analyzer" : "hanlp"
        },
        "updated" : {
          "type" : "long"
        },
        "url" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "url_token" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "analyzer" : "hanlp"
        }
      }
    }
  }

（2）将旧索引copy至新索引

POST /_reindex?wait_for_completion=false
{
"source": {
"index": "articles",
"size":5000
},
"dest": {
"index": "articles_new",
"op_type": "create"
},
"conflicts": "proceed"
}


参数说明：
* 若 reindex 时间过⻓，建议加上 wait_for_completion=false 参数返回taskId。
  这样可以使用taskId查看当前任务进程。


* op_type：控制着写入数据的冲突处理方式。若 op_type 设置为 create【默认值】，
  如果document已经在新索引中存在，则报错(version confilct)，否则将创建。
  若将 op_type 设置为 index，表示所有数据将重新创建索引。


* conflicts，当出现报错(version confilct)时，_reindex会终止(数据可能没有 reindex 完成)。
  此时，可将 conflicts 设置为 proceed，使得旧索引继续copy至新索引。


* size：批次大小配置，默认一批1000条数据。

输出结果：

{
  "task" : "bEECFrEzTv-zaWKADdtRWw:29118"
}
# 29118 任务id

note: 若source index 很大（几百万数据量），这可能需要较长的时间来完成 _reindex 的工作。在这期间不必一直等待结果，此时可以去做其它事情。如果中途需要查看进度，可以通过 _tasks API 进行查看。

GET /_tasks/bEECFrEzTv-zaWKADdtRWw:29118

当执行完毕时，completed为true

（3）删除旧索引

DELETE /zhuanlan

（4）使用别名

POST /_aliases
{
    "actions":[
        {
            "add":{
                "index":"zhuanlan_new",
                "alias":"zhuanlan"
            }
        }
    ]
}

（5）测试一下，使用词语进行搜索

GET _search
{
    "query":{
        "term":{
            "content":{
                "value":"运营"
            }
        }
    }
}

返回搜索结果

参考链接

通过Monstache实时同步MongoDB数据至Elasticsearch

「Elasticsearch」ES重建索引怎么才能做到数据无缝迁移呢？