在弄清楚为什么会出现乱码,先去了解什么是字符集,什么是编码.
个人认为这不是同一个东西.附上连接
https://www.zhihu.com/question/23374078 Unicode 和 UTF-8 有何区别
http://about.uuspider.com/2015/07/20/decode.html 字符编码与UTF-8
1.GET请求
当我们在浏览器输入 http://localhost?param=是啥
其实真正的请求是 http://localhost?param=%E6%98%AF%E5%95%A5
我们用代码演示下
public class URLCode {
public static void main(String[] args) throws UnsupportedEncodingException {
System.out.println(URLEncoder.encode("是啥", "utf-8"));
//输出:%E6%98%AF%E5%95%A5
System.out.println(URLEncoder.encode("是啥", "gbk"));
//输出:%CA%C7%C9%B6
}
}
抱着追根究底的态度.我们来看下encode这个方法干了什么
public static String encode(String s, String enc)
throws UnsupportedEncodingException {
//标志,是否需要转换(encode)
boolean needToChange = false;
StringBuffer out = new StringBuffer(s.length());
Charset charset;
CharArrayWriter charArrayWriter = new CharArrayWriter();
if (enc == null)
throw new NullPointerException("charsetName");
try {
charset = Charset.forName(enc);
} catch (IllegalCharsetNameException e) {
throw new UnsupportedEncodingException(enc);
} catch (UnsupportedCharsetException e) {
throw new UnsupportedEncodingException(enc);
}
for (int i = 0; i < s.length();) {
//获取ASCII码
int c = (int) s.charAt(i);
//判断是否在过滤集合里
if (dontNeedEncoding.get(c)) {
//如果是空格,就转换为+号
if (c == ' ') {
c = '+';
//设置需要转换
needToChange = true;
}
out.append((char)c);
i++;
} else {
do {
charArrayWriter.write(c);
//以下这块代码是特殊处理
//可以看下这篇文章 http://www.cnblogs.com/lanelim/p/4964947.html
if (c >= 0xD800 && c <= 0xDBFF) {
if ( (i+1) < s.length()) {
int d = (int) s.charAt(i+1);
if (d >= 0xDC00 && d <= 0xDFFF) {
charArrayWriter.write(d);
i++;
}
}
}
i++;
} while (i < s.length() && !dontNeedEncoding.get((c = (int) s.charAt(i))));
charArrayWriter.flush();
String str = new String(charArrayWriter.toCharArray());
byte[] ba = str.getBytes(charset);
for (int j = 0; j < ba.length; j++) {
out.append('%');
//高四位
char ch = Character.forDigit((ba[j] >> 4) & 0xF, 16);
if (Character.isLetter(ch)) {
ch -= caseDiff;//如果是英文,-32.使用大写英文表示
}
out.append(ch);
//低四位
ch = Character.forDigit(ba[j] & 0xF, 16);
if (Character.isLetter(ch)) {
ch -= caseDiff;
}
out.append(ch);
}
charArrayWriter.reset();
needToChange = true;
}
}
return (needToChange ? out.toString() : s);
}
其实如何加密的不是关注点.重点只有一行代码
byte[] ba = str.getBytes(charset);
把需要encode的字符筛选出来之后,用指定的charset(字符集)进行编码,也就是获取字节码.
回顾文章开头所说的浏览器的行为,它会对url(包括拼接的参数)进行encode.
这个行为使用的字符集是我们可以控制的.
如下是H5页面片段
<head>
<meta charset="GBK">
<title>Title</title>
</head>
这样,就告诉浏览器,encode给我使用GBK编码
看下面这个例子
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<form action="http://127.0.0.1:17788" method="GET">
<input type="hidden" name="param" value="是啥">
<button type="submit">提交</button>
</form>
</body>
</html>
点击提交,然后通过调试器看
这是编码为utf-8浏览器的部分行为(通过浏览器调试工具抓取)
Request URL:http://127.0.0.1:17788/?param=%E6%98%AF%E5%95%A5
Request Method:GET
Status Code:200 OK
Remote Address:127.0.0.1:17788
Referrer Policy:no-referrer-when-downgrade
现在我们把编码格式换为GBK
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="GBK">
<title>Title</title>
</head>
<body>
<form action="http://127.0.0.1:17788" method="GET">
<input type="hidden" name="param" value="是啥">
<button type="submit">提交</button>
</form>
</body>
</html>
点击提交,然后通过调试器看
这是编码为GBK浏览器的部分行为(通过浏览器调试工具抓取)
Request URL:http://127.0.0.1:17788/?param=%CA%C7%C9%B6
Request Method:GET
Status Code:200 OK
Remote Address:127.0.0.1:17788
Referrer Policy:no-referrer-when-downgrade
参数就是这么传输到servlet容器那边的
所以servlet容器,以tomcat为例,需要在配置文件加上这么一段话
URIEncoding="UTF-8",页面使用的编码和URIEncoding的编码要一致
2.POST请求
其实本质搞明白了之后,可以无视编码的存在
浏览器传输只支持iso-8859-1 (目前,不正确可以指出来)
我想结合工作中碰到的一个问题.讲些知识点
我想我们大部分java-web工程都会有一段characetEncodingFilter的配置
<filter>
<filter-name>characterEncodingFilter</filter-name>
<filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<!-- 根据项目的编码来,GBK或者是UTF-8-->
<param-value>utf-8</param-value>
</init-param>
</filter>
如果是spring boot
application.properties会这样配置
spring.http.encoding.enabled=true
spring.http.encoding.charset=UTF-8
spring.http.encoding.force=true
框架会自动注入CharacterEncodingFilter这个bean
@Configuration
@EnableConfigurationProperties(HttpEncodingProperties.class)
@ConditionalOnClass(CharacterEncodingFilter.class)
@ConditionalOnProperty(prefix = "spring.http.encoding", value = "enabled", matchIfMissing = true)
public class HttpEncodingAutoConfiguration {
@Autowired
private HttpEncodingProperties httpEncodingProperties;
@Bean
@ConditionalOnMissingBean(CharacterEncodingFilter.class)
public CharacterEncodingFilter characterEncodingFilter() {
CharacterEncodingFilter filter = new OrderedCharacterEncodingFilter();
filter.setEncoding(this.httpEncodingProperties.getCharset().name());
filter.setForceEncoding(this.httpEncodingProperties.isForce());
return filter;
}
}
默认编码在HttpEncodingProperties中写明了,是utf-8
@ConfigurationProperties(prefix = "spring.http.encoding")
public class HttpEncodingProperties {
public static final Charset DEFAULT_CHARSET = Charset.forName("UTF-8");
private Charset charset = DEFAULT_CHARSET;
private boolean force = true;
public Charset getCharset() {
return this.charset;
}
public void setCharset(Charset charset) {
this.charset = charset;
}
public boolean isForce() {
return this.force;
}
public void setForce(boolean force) {
this.force = force;
}
}
所以application.properties这个配置呢.你写不写,都会默认注入CharacterEncodingFilter
然后坑来了.
一般新项目用的编码都是UTF-8,
但某些古老的银行啊,第三方支付啊,他们的回调用的都是GBK编码的POST请求(如果是get就坑的不要不要的).
然后我们的项目用了characterEncodingFilter
所以.httpServletRequest到达了我们自己写的controller的时候.GBK的字节码已经被UFT-8解码过了.
所以我们在controller中使用request.setCharacterEncoding("GBK");
然后request.getParameter()得到的参数还是乱码.
记住.GBK的字节码被utf-8解码过后,是无法再编码回去的,反之同理.(英文啊,数字啊除外,这里你看了我前面贴的俩篇文章应该就明白了)
我觉得放个代码例子比较好理解...也能加深自己理解
public static void main(String[] args) throws UnsupportedEncodingException {
String testString = "你猜啊";
//获取UTF-8编码后的字节码
byte[] UTF8Byte = testString.getBytes("UTF-8");
//获取GBK编码后的字节码
byte[] GBKByte = testString.getBytes("GBK");
//结果:[-28, -67, -96, -25, -116, -100, -27, -107, -118]
System.out.println(Arrays.toString(UTF8Byte));
//结果:[-60, -29, -78, -62, -80, -95]
System.out.println(Arrays.toString(GBKSByte));
//你猜�?
String UTF8String = new String(new String(UTF8Byte, "GBK").getBytes("GBK"), "UTF-8");
//锟斤拷掳锟�
String GBKString = new String(new String(GBKSByte, "UTF-8").getBytes("UTF-8"), "GBK");
//你猜啊
String UTF8String1 = new String(new String(UTF8Byte, "iso-8859-1").getBytes("iso-8859-1"), "UTF-8");
//你猜啊
String GBKString1 = new String(new String(GBKSByte, "iso-8859-1").getBytes("iso-8859-1"), "GBK");
}
从上面的代码可以看出来.传输为什么要用iso-8859-1了.因为被编码和解码都不会破坏原字节信息.
GBK是用2个字节存储.
UTF-8是可变长存储.大部分中文是3个字节存储.
这也就是UTF-8格式的字节码给GBK解码编码后还能被解码.但是会多出两个字节.
而反过来.GBK的字节码被UTF-解码后再编码,会造成数据的丢失.数据全部失真.
回到我工作遇到的问题来.
显然框架自带的过滤器不禁不能满足现有的需求.反而还会坑我们一把
这里我给出Spring Boot的解决方案....(因为现在公司用的SB啊----我发誓我没骂公司.Spring Boot 缩写啊.没毛病啊~老铁)
步骤1:写一个自己的拦截器
public class MyCharacterEncodingFilter extends CharacterEncodingFilter implements Ordered {
@Override
protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response, FilterChain filterChain) throws ServletException, IOException {
//这里存放需要做兼容的接口----某银行回调接口
if (request.getServletPath().equals("/xxxx/callback")){
request.setCharacterEncoding("GBK");
}
else {
request.setCharacterEncoding("UTF-8");
}
filterChain.doFilter(request, response);
}
private int order = Ordered.HIGHEST_PRECEDENCE;
@Override
public int getOrder() {
return this.order;
}
public void setOrder(int order) {
this.order = order;
}
}
步骤2:让自己写的拦截器被框架注册并使用
@Configuration
@EnableConfigurationProperties(HttpEncodingProperties.class)
@ConditionalOnClass(CharacterEncodingFilter.class)
@ConditionalOnProperty(prefix = "spring.http.encoding", value = "enabled", matchIfMissing = true)
public class MyHttpEncodingAutoConfiguration {
@Autowired
private HttpEncodingProperties httpEncodingProperties;
@Bean
@ConditionalOnMissingBean(CharacterEncodingFilter.class)
public CharacterEncodingFilter characterEncodingFilter() {
//自己定义的编码过滤器
CharacterEncodingFilter filter = new MyCharacterEncodingFilter();
filter.setEncoding(this.httpEncodingProperties.getCharset().name());
filter.setForceEncoding(this.httpEncodingProperties.isForce());
return filter;
}
}
3.response响应
之前写这篇文章的时候没考虑到响应也会出现乱码.其实这种情况确实是存在
就在昨天,同事就遇到了给页面返回字符时.出现乱码
出现的原因应该是这样.你指定了响应的字节码是UTF-8.
但是你没告诉浏览器用啥字符集去解码啊
所以除了指定下面这段代码后
response.setCharacterEncoding("UTF-8");
你还要告诉浏览器.我返回的是这个字符集的字节码哦
response.setHeader("Content-Type", "text/html;charset=UTF-8");
其实看了我推荐的那俩文章..多多少少都能理解了吧.数据库乱码一个道理
好了...说完了..写文章是真的累