MySQL relaylog切换导致业务响应身高

背景

MySQL从5.7.17升级到MySQL5.7.32后，业务反馈隔一段时间就有几百笔交易调用数据库响应超过200ms触发报警

环境配置

主从复制类型：半同步复制
mysql_version：oracle mysql-5.7.32
max_binlog_size = 1G
max_relay_log_size = 1G

原因

mysql-5.7.25-relnotes:
The value returned by a [`SHOW SLAVE STATUS`](https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/show-slave-status.html) statement for the total combined size of all existing relay log files (`Relay_Log_Space`) could become much larger than the actual disk space used by the relay log files. 
The I/O thread did not lock the variable while it updated the value, so the SQL thread could automatically delete a relay log file and write a reduced value before the I/O thread finished updating the value. 
The I/O thread then wrote its original size calculation, ignoring the SQL thread's update and so adding back the space for the deleted file. The `Relay_Log_Space` value is now locked during updates to prevent concurrent updates and ensure an accurate calculation. (Bug #26997096, Bug #87832)

MySQL为了解决 [Bug #87832] show slave status不准的问题，在io线程写relaylog的时候增加了一把log_space_lock日志空间锁，当relaylog切换的时候SQL线程也会持有该锁，当relaylog切换的时候整个流程如下：
1.sql thread get the log_space_lock
......
2.删除relaylog物理文件并更改当前relaylog占用总空间
mysql_file_delete(key_file_binlog, log_info.log_file_name, MYF(0)) function delete the relay_log file
if (decrease_log_space)
    *decrease_log_space-= s.st_size;
......
3.sql thread release the log_space_lock

io thread 在1-3期间无法写入 event(写event需要获取日志空间锁)无法给主库返回ack导致业务响应升高
该问题相当于解决一个bug又引入了另外一个bug，对应bug已经提交给官方，官方已经确认（https://bugs.mysql.com/bug.php?id=103943）

为什么删除1G文件需要200ms左右

文件系统存在缓存，当删除此盘文件的时候需要将cache中的数据先清理出去，然后删除磁盘文件，类似于数据库drop 表时清理bufferpool，经测试1G文件不清空缓存删除文件150ms，删除缓存后再删除文件6ms

如何解决这个问题

方案一
给relaylog做硬连接，让mysql清理relaylog时只删除inode，剩下的垃圾文件写脚本删除
方案2
更改mysql源码，将锁范围缩小
1.删除relaylog物理文件
mysql_file_delete(key_file_binlog, log_info.log_file_name, MYF(0)) function delete the relay_log file
2.sql thread get the log_space_lock
3.更改当前relaylog占用总空间
if (decrease_log_space)
    *decrease_log_space-= s.st_size;
4.sql thread release the log_space_lock

最后编辑于：2021.08.22 08:14:30