背景
MySQL从5.7.17升级到MySQL5.7.32后,业务反馈隔一段时间就有几百笔交易调用数据库响应超过200ms触发报警
环境配置
主从复制类型:半同步复制
mysql_version:oracle mysql-5.7.32
max_binlog_size = 1G
max_relay_log_size = 1G
原因
mysql-5.7.25-relnotes:
The value returned by a [`SHOW SLAVE STATUS`](https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/show-slave-status.html) statement for the total combined size of all existing relay log files (`Relay_Log_Space`) could become much larger than the actual disk space used by the relay log files.
The I/O thread did not lock the variable while it updated the value, so the SQL thread could automatically delete a relay log file and write a reduced value before the I/O thread finished updating the value.
The I/O thread then wrote its original size calculation, ignoring the SQL thread's update and so adding back the space for the deleted file. The `Relay_Log_Space` value is now locked during updates to prevent concurrent updates and ensure an accurate calculation. (Bug #26997096, Bug #87832)
MySQL为了解决 [Bug #87832] show slave status不准的问题,在io线程写relaylog的时候增加了一把log_space_lock日志空间锁,当relaylog切换的时候SQL线程也会持有该锁,当relaylog切换的时候整个流程如下:
1.sql thread get the log_space_lock
......
2.删除relaylog物理文件并更改当前relaylog占用总空间
mysql_file_delete(key_file_binlog, log_info.log_file_name, MYF(0)) function delete the relay_log file
if (decrease_log_space)
*decrease_log_space-= s.st_size;
......
3.sql thread release the log_space_lock
io thread 在1-3期间无法写入 event(写event需要获取日志空间锁)无法给主库返回ack导致业务响应升高
该问题相当于解决一个bug又引入了另外一个bug,对应bug已经提交给官方,官方已经确认(https://bugs.mysql.com/bug.php?id=103943)
为什么删除1G文件需要200ms左右
文件系统存在缓存,当删除此盘文件的时候需要将cache中的数据先清理出去,然后删除磁盘文件,类似于数据库drop 表时清理bufferpool,经测试1G文件不清空缓存删除文件150ms,删除缓存后再删除文件6ms
如何解决这个问题
方案一
给relaylog做硬连接,让mysql清理relaylog时只删除inode,剩下的垃圾文件写脚本删除
方案2
更改mysql源码,将锁范围缩小
1.删除relaylog物理文件
mysql_file_delete(key_file_binlog, log_info.log_file_name, MYF(0)) function delete the relay_log file
2.sql thread get the log_space_lock
3.更改当前relaylog占用总空间
if (decrease_log_space)
*decrease_log_space-= s.st_size;
4.sql thread release the log_space_lock