springboot集成ES实现磁盘文件全文检索

最近有个朋友咨询如何实现对海量磁盘资料进行目录、文件名及文件正文进行搜索，要求实现简单高效、维护方便、成本低廉。我想了想利用ES来实现文档的索引及搜索是适当的选择，于是就着手写了一些代码来实现，下面就将设计思路及实现方法作以介绍。

整体架构

考虑到磁盘文件分布到不同的设备上，所以采用磁盘扫瞄代理的模式构建系统，即把扫描服务以代理的方式部署到目标磁盘所在的服务器上，作为定时任务执行，索引统一建立到ES中，当然ES采用分布式高可用部署方法，搜索服务和扫描代理部署到一起来简化架构并实现分布式能力。

磁盘文件快速检索架构

部署ES

ES（elasticsearch）是本项目唯一依赖的第三方软件，ES支持docker方式部署，以下是部署过程

docker pull docker.elastic.co/elasticsearch/elasticsearch:6.3.2
docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9200:9200 -p 9300:9300 --name es01 docker.elastic.co/elasticsearch/elasticsearch:6.3.2

部署完成后，通过浏览器打开http://localhost:9200，如果正常打开，出现如下界面，则说明ES部署成功。

ES界面

工程结构

依赖包

本项目除了引入springboot的基础starter外，还需要引入ES相关包

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
        </dependency>
        <dependency>
            <groupId>io.searchbox</groupId>
            <artifactId>jest</artifactId>
            <version>5.3.3</version>
        </dependency>
        <dependency>
            <groupId>net.sf.jmimemagic</groupId>
            <artifactId>jmimemagic</artifactId>
            <version>0.1.4</version>
        </dependency>
    </dependencies>

配置文件

需要将ES的访问地址配置到application.yml里边，同时为了简化程序，需要将待扫描磁盘的根目录（index-root）配置进去，后面的扫描任务就会递归遍历该目录下的全部可索引文件。

server:
  port: @elasticsearch.port@
spring:
  application:
    name: @project.artifactId@
  profiles:
    active: dev
  elasticsearch:
    jest:
      uris: http://127.0.0.1:9200
index-root: /Users/crazyicelee/mywokerspace

索引结构数据定义

因为要求文件所在目录、文件名、文件正文都有能够检索，所以要将这些内容都作为索引字段定义，而且添加ES client要求的JestId来注解id。

package com.crazyice.lee.accumulation.search.data;

import io.searchbox.annotations.JestId;
import lombok.Data;

@Data
public class Article {
    @JestId
    private Integer id;
    private String author;
    private String title;
    private String path;
    private String content;
    private String fileFingerprint;
}

扫描磁盘并创建索引

因为要扫描指定目录下的全部文件，所以采用递归的方法遍历该目录，并标识已经处理的文件以提升效率，在文件类型识别方面采用两种方式可供选择，一个是文件内容更为精准判断（Magic），一种是以文件扩展名粗略判断。这部分是整个系统的核心组件。

这里有个小技巧

对目标文件内容计算MD5值并作为文件指纹存储到ES的索引字段里边，每次在重建索引的时候判断该MD5是否存在，如果存在就不用重复建立索引了，可以避免文件索引重复，也能避免系统重启后重复遍历文件。

package com.crazyice.lee.accumulation.search.service;

import com.alibaba.fastjson.JSONObject;
import com.crazyice.lee.accumulation.search.data.Article;
import com.crazyice.lee.accumulation.search.utils.Md5CaculateUtil;
import io.searchbox.client.JestClient;
import io.searchbox.core.Index;
import io.searchbox.core.Search;
import io.searchbox.core.SearchResult;
import lombok.extern.slf4j.Slf4j;
import net.sf.jmimemagic.*;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

@Component
@Slf4j
public class DirectoryRecurse {

    @Autowired
    private JestClient jestClient;

    //读取文件内容转换为字符串
    private String readToString(File file, String fileType) {
        StringBuffer result = new StringBuffer();
        switch (fileType) {
            case "text/plain":
            case "java":
            case "c":
            case "cpp":
            case "txt":
                try (FileInputStream in = new FileInputStream(file)) {
                    Long filelength = file.length();
                    byte[] filecontent = new byte[filelength.intValue()];
                    in.read(filecontent);
                    result.append(new String(filecontent, "utf8"));
                } catch (FileNotFoundException e) {
                    log.error("{}", e.getLocalizedMessage());
                } catch (IOException e) {
                    log.error("{}", e.getLocalizedMessage());
                }
                break;
            case "doc":
                //使用HWPF组件中WordExtractor类从Word文档中提取文本或段落
                try (FileInputStream in = new FileInputStream(file)) {
                    WordExtractor extractor = new WordExtractor(in);
                    result.append(extractor.getText());
                } catch (Exception e) {
                    log.error("{}", e.getLocalizedMessage());
                }
                break;
            case "docx":
                try (FileInputStream in = new FileInputStream(file); XWPFDocument doc = new XWPFDocument(in)) {
                    XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
                    result.append(extractor.getText());
                } catch (Exception e) {
                    log.error("{}", e.getLocalizedMessage());
                }
                break;
        }
        return result.toString();
    }

    //判断是否已经索引
    private JSONObject isIndex(File file) {
        JSONObject result = new JSONObject();
        //用MD5生成文件指纹,搜索该指纹是否已经索引
        String fileFingerprint = Md5CaculateUtil.getMD5(file);
        result.put("fileFingerprint", fileFingerprint);
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.query(QueryBuilders.termQuery("fileFingerprint", fileFingerprint));
        Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex("diskfile").addType("files").build();
        try {
            //执行
            SearchResult searchResult = jestClient.execute(search);
            if (searchResult.getTotal() > 0) {
                result.put("isIndex", true);
            } else {
                result.put("isIndex", false);
            }
        } catch (IOException e) {
            log.error("{}", e.getLocalizedMessage());
        }
        return result;
    }

    //对文件目录及内容创建索引
    private void createIndex(File file, String method) {
        //忽略掉临时文件，以~$起始的文件名
        if (file.getName().startsWith("~$")) return;

        String fileType = null;
        switch (method) {
            case "magic":
                Magic parser = new Magic();
                try {
                    MagicMatch match = parser.getMagicMatch(file, false);
                    fileType = match.getMimeType();
                } catch (MagicParseException e) {
                    //log.error("{}",e.getLocalizedMessage());
                } catch (MagicMatchNotFoundException e) {
                    //log.error("{}",e.getLocalizedMessage());
                } catch (MagicException e) {
                    //log.error("{}",e.getLocalizedMessage());
                }
                break;
            case "ext":
                String filename = file.getName();
                String[] strArray = filename.split("\\.");
                int suffixIndex = strArray.length - 1;
                fileType = strArray[suffixIndex];
        }

        switch (fileType) {
            case "text/plain":
            case "java":
            case "c":
            case "cpp":
            case "txt":
            case "doc":
            case "docx":
                JSONObject isIndexResult = isIndex(file);
                log.info("文件名：{}，文件类型：{}，MD5：{}，建立索引：{}", file.getPath(), fileType, isIndexResult.getString("fileFingerprint"), isIndexResult.getBoolean("isIndex"));

                if (isIndexResult.getBoolean("isIndex")) break;
                //1. 给ES中索引(保存)一个文档
                Article article = new Article();
                article.setTitle(file.getName());
                article.setAuthor(file.getParent());
                article.setPath(file.getPath());
                article.setContent(readToString(file, fileType));
                article.setFileFingerprint(isIndexResult.getString("fileFingerprint"));
                //2. 构建一个索引
                Index index = new Index.Builder(article).index("diskfile").type("files").build();
                try {
                    //3. 执行
                    if (!jestClient.execute(index).getId().isEmpty()) {
                        log.info("构建索引成功！");
                    }
                } catch (IOException e) {
                    log.error("{}", e.getLocalizedMessage());
                }
                break;
        }
    }

    public void find(String pathName) throws IOException {
        //获取pathName的File对象
        File dirFile = new File(pathName);

        //判断该文件或目录是否存在，不存在时在控制台输出提醒
        if (!dirFile.exists()) {
            log.info("do not exit");
            return;
        }

        //判断如果不是一个目录，就判断是不是一个文件，时文件则输出文件路径
        if (!dirFile.isDirectory()) {
            if (dirFile.isFile()) {
                createIndex(dirFile, "ext");
            }
            return;
        }

        //获取此目录下的所有文件名与目录名
        String[] fileList = dirFile.list();

        for (int i = 0; i < fileList.length; i++) {
            //遍历文件目录
            String string = fileList[i];
            File file = new File(dirFile.getPath(), string);
            //如果是一个目录，输出目录名后，进行递归
            if (file.isDirectory()) {
                //递归
                find(file.getCanonicalPath());
            } else {
                createIndex(file, "ext");
            }
        }
    }
}

扫描任务

这里采用定时任务的方式来扫描指定目录以实现动态增量创建索引。

package com.crazyice.lee.accumulation.search.service;

import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

import java.io.IOException;

@Configuration
@Component
@Slf4j
public class CreateIndexTask {
    @Autowired
    private DirectoryRecurse directoryRecurse;

    @Value("${index-root}")
    private String indexRoot;

    @Scheduled(cron = "* 0/5  * * * ?")
    private void addIndex(){
        try {
            directoryRecurse.find(indexRoot);
            directoryRecurse.writeIndexStatus();
        } catch (IOException e) {
            log.error("{}",e.getLocalizedMessage());
        }
    }
}

搜索服务

这里以restFul的方式提供搜索服务，将关键字以高亮度模式提供给前端UI，浏览器端可以根据返回的JSON进行展示。

package com.crazyice.lee.accumulation.search.web;

import com.alibaba.fastjson.JSONObject;
import com.crazyice.lee.accumulation.search.data.Article;
import io.searchbox.client.JestClient;
import io.searchbox.core.Search;
import io.searchbox.core.SearchResult;
import io.swagger.annotations.ApiImplicitParam;
import io.swagger.annotations.ApiImplicitParams;
import io.swagger.annotations.ApiOperation;
import lombok.extern.slf4j.Slf4j;
import org.elasticsearch.index.query.BoolQueryBuilder;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.lang.NonNull;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RestController;

import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

@RestController
@Slf4j
public class Controller {
    @Autowired
    private JestClient jestClient;

    @RequestMapping(value = "/search/{keyword}",method = RequestMethod.GET)
    @ApiOperation(value = "全部字段搜索关键字",notes = "es验证")
    @ApiImplicitParams(
            @ApiImplicitParam(name = "keyword",value = "全文检索关键字",required = true,paramType = "path",dataType = "String")
    )
    public List search(@PathVariable String keyword){
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.query(QueryBuilders.queryStringQuery(keyword));

        HighlightBuilder highlightBuilder = new HighlightBuilder();
        //path属性高亮度
        HighlightBuilder.Field highlightPath = new HighlightBuilder.Field("path");
        highlightPath.highlighterType("unified");
        highlightBuilder.field(highlightPath);
        //title字段高亮度
        HighlightBuilder.Field highlightTitle = new HighlightBuilder.Field("title");
        highlightTitle.highlighterType("unified");
        highlightBuilder.field(highlightTitle);
        //content字段高亮度
        HighlightBuilder.Field highlightContent = new HighlightBuilder.Field("content");
        highlightContent.highlighterType("unified");
        highlightBuilder.field(highlightContent);

        //高亮度配置生效
        searchSourceBuilder.highlighter(highlightBuilder);

        log.info("搜索条件{}",searchSourceBuilder.toString());

        //构建搜索功能
        Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex( "gf" ).addType( "news" ).build();
        try {
            //执行
            SearchResult result = jestClient.execute( search );
            return result.getHits(Article.class);
        } catch (IOException e) {
            log.error("{}",e.getLocalizedMessage());
        }
        return null;
    }
}

搜索restFul结果测试

这里以swagger的方式进行API测试。其中keyword是全文检索中要搜索的关键字。

搜索结果

使用thymeleaf生成UI

集成thymeleaf的模板引擎直接将搜索结果以web方式呈现。模板包括主搜索页和搜索结果页，通过@Controller注解及Model对象实现。

<body>
    <div class="container">
        <div class="header">
            <form action="./search" class="parent">
                <input type="keyword" name="keyword" th:value="${keyword}">
                <input type="submit" value="搜索">
            </form>
        </div>

        <div class="content" th:each="article,memberStat:${articles}">
            <div class="c_left">
                <p class="con-title" th:text="${article.title}"/>
                <p class="con-path" th:text="${article.path}"/>
                <p class="con-preview" th:utext="${article.highlightContent}"/>
                <a class="con-more">更多</a>
            </div>
            <div class="c_right">
                <p class="con-all" th:utext="${article.content}"/>
            </div>
        </div>

        <script language="JavaScript">
            document.querySelectorAll('.con-more').forEach(item => {
                item.onclick = () => {
                item.style.cssText = 'display: none';
                item.parentNode.querySelector('.con-preview').style.cssText = 'max-height: none;';
            }});
        </script>
    </div>

最后编辑于：2019.08.26 16:24:07

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,324评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,303评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,192评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,555评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,569评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,566评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,927评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,583评论 0赞 257
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,827评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,590评论 2赞 320
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,669评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,365评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,941评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,928评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,159评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,880评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,399评论 2赞 342