周末抽空帮同学论文写了一段统计词频的代码,做个简单总结。出于职业病,代码使用Springboot+Maven搭建,面向抽象编程,并通过web请求控制执行。
- 依赖配置
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
<exclusions>
<exclusion>
<groupId>org.junit.vintage</groupId>
<artifactId>junit-vintage-engine</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox -->
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.18</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
</dependency>
<dependency>
<groupId>org.ansj</groupId>
<artifactId>ansj_seg</artifactId>
<version>5.1.1</version>
</dependency>
</dependencies>
- 递归遍历文件夹(这么简单的代码面试居然忘了怎么写,狂扇自己ing),读取所有需要统计词频的文件。
- 出于方便考量,文件夹的根路径和需要统计的关键词都放在该类中维护 ,代码:
package com.zz.jxh.utils;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
/**
* @Author: zhouzhen
* @email: zhouzhen0517@foxmail.com
* @Description
* @Date: Create in 13:11 2020/4/25
*/
public class ScanningFiles {
public static List<String> filelists = new ArrayList<>();
public static final String ROOT_FILE_PATH = "E:\\WeChat Files\\megumi_ka_to\\FileStorage\\File\\2020-04\\样本文件";
public static final String[] KEYWORDS = {"营业收入","商誉减值准备","应收账款","存货","关联关系","毛利率","行业平均","变动趋势","年审会计师","会计准则"};
public static List<String> getfiles(String filepath) {
File root = new File(filepath);
File[] files = root.listFiles();
for(File file : files) {
//如果是目录,递归调用
if(file.isDirectory()) {
getfiles(file.getAbsolutePath());
} else {
filelists.add(file.getAbsolutePath());
}
}
return filelists;
}
}
- 创建Service接口和实现,Service层提供两个方法,传入文件路径,获取内容和传入内容和关键词,获取词频。
- pdf解析被单独抽离成了一个方法getTextFromPdf。
- 接口:
package com.zz.jxh.service;
import java.util.Map;
/**
* @Author: zhouzhen
* @email: zhouzhen0517@foxmail.com
* @Description
* @Date: Create in 12:38 2020/4/25
*/
public interface JxhService {
String getFile(String filepath) throws Exception;
Map WordFrequencyStatistics(String content, String[] keywords);
}
- 实现
package com.zz.jxh.service.impl;
import com.zz.jxh.service.JxhService;
import lombok.extern.slf4j.Slf4j;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import org.springframework.stereotype.Service;
import org.springframework.util.StringUtils;
import java.io.File;
import java.util.HashMap;
import java.util.Map;
/**
* @Author: zhouzhen
* @email: zhouzhen0517@foxmail.com
* @Description
* @Date: Create in 12:43 2020/4/25
*/
@Service
@Slf4j
public class JxhServiceImpl implements JxhService {
@Override
public String getFile(String filepath) throws Exception {
String content = getTextFromPdf(filepath);
if(StringUtils.isEmpty(content)) {
throw new Exception("读取内容为空,请确认文件" + filepath);
}
return content;
}
@Override
public Map WordFrequencyStatistics(String content, String[] keywords) {
Map<String, Integer> map = new HashMap<>();
Map<String, Object> map1 = new HashMap<>();
String title = getTitle(content);
log.info("正在处理" + title);
for(String keyword : keywords) {
//当前keyword出现频率
int times = 0;
//当前索引位置
int index;
//下一次查找开始位置
int next = 0;
//当index=-1时,不存在keyword,查找结束
while((index = content.indexOf(keyword, next)) != -1) {
times++;
next = index + keyword.length();
}
map.put(keyword, times);
}
map1.put("标题",title);
map1.put("词频统计",map);
return map1;
}
private String getTitle(String content) {
StringBuffer stringBuffer = new StringBuffer(content);
int start = stringBuffer.indexOf("关于");
int end = stringBuffer.indexOf("问询函") + "问询函".length();
return stringBuffer.substring(start, end);
}
/**
* @param pdfFilePath
* @return java.lang.String
* @author zhouzhen
* @Description 传入文件路径,返回对应的pdf内容
* @date 2020/4/25 13:35
*/
private String getTextFromPdf(String pdfFilePath) {
try {
File file = new File(pdfFilePath);
PDDocument document = PDDocument.load(file);
document.getClass();
if(!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper textStripper = new PDFTextStripper();
String exposeContent = textStripper.getText(document);
exposeContent = exposeContent.replace("\r\n", "");
exposeContent = exposeContent.replace(" ", "");
document.close();
return exposeContent;
}
} catch(Exception e) {
log.error("读取pdf文件异常...");
}
return "";
}
}
- 控制器
package com.zz.jxh.controller;
import com.zz.jxh.service.JxhService;
import com.zz.jxh.utils.ScanningFiles;
import lombok.extern.slf4j.Slf4j;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import javax.annotation.Resource;
import java.util.HashMap;
import java.util.Map;
/**
* @Author: zhouzhen
* @email: zhouzhen0517@foxmail.com
* @Description
* @Date: Create in 13:36 2020/4/25
*/
@RestController
@Slf4j
@RequestMapping("/start")
public class TestController {
@Resource
private JxhService jxhService;
@GetMapping("/run")
public ResponseEntity startMain() throws Exception {
Map<String, Map> maps = new HashMap<>();
ScanningFiles.getfiles(ScanningFiles.ROOT_FILE_PATH);
for(String filename : ScanningFiles.filelists) {
String content = jxhService.getFile(filename);
maps.put(filename, jxhService.WordFrequencyStatistics(content, ScanningFiles.KEYWORDS));
}
return ResponseEntity.ok(maps);
}
}
- 测试,浏览器访问http://localhost:8080/start/run
-
测试用例使用了深交所和上交所的财务报告
TIPS:其他格式的文件需要修改解析方法getTextFromPdf