概要
之前看过《高性能mysql》对mysql数据库有了系统化的理解,虽然没能达到精通,但有了概念,遇到问题时会有逻辑条理的分析;
问题
问题:公司xxx页面调用某个接口时,loading数秒,之后返回结果集不完整;
分析思路
查看服务A端日志,日志并没有发现error;但发现该API接口还调用了另外一个服务B的API接口,并且返回内容和预期返回结果不匹配(返回结果不完整),将问题定位到服务B端; 查看服务B端日志发现,偶发性的报如下error,但不报error时返回结果集依然存在不完整;
org.apache.catalina.connector.ClientAbortException: java.io.IOException: Broken pipe at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:321) at org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:284) at org.apache.catalina.connector.CoyoteOutputStream.flush(CoyoteOutputStream.java:118) at com.fasterxml.jackson.core.json.UTF8JsonGenerator.flush(UTF8JsonGenerator.java:1054) at com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:953) at org.springframework.http.converter.json.AbstractJackson2HttpMessageConverter.writeInternal(AbstractJackson2HttpMessageConverter.java:286) at org.springframework.http.converter.AbstractGenericHttpMessageConverter.write(AbstractGenericHttpMessageConverter.java:106) at org.springframework.web.servlet.mvc.method.annotation.AbstractMessageConverterMethodProcessor.writeWithMessageConverters(AbstractMessageConverterMethodProcessor.java:231) 其余内容省略
(具体过程略,因为重点不讲这个)通过google 查询分析推测是由于Client 端主动关闭连接导致的,根据服务B日志发现,处理服务A请求花了12~16s之间,而服务端A的OkHttp connection timeout为10s;于是相关开发人员将超时时间调到了60s,发现问题解决了。然后就干其他事情去了;
虽然看上去问题是已经解决了,但其实并没有解决根本问题;本人之前系统化的看过sql相关书籍所以非常感兴趣,利用工作业余时间去尝试探知分析一下,随之仔细排查,发现处理的地方应该是服务B请求过长问题;根据服务B Api接口发现,主要操作为Sql语句;为此我整理出里相关sql语句;如下
-- ①select count(p.id) from poeinfo p,recordregistryinfo r where p.record_registry_info_id=r.id and r.contract_address=?-- ②select sum(a.oss_volume) from (select p.url ,p.oss_volume from poehistoryinfo p,recordregistryinfo r where p.record_registry_info_id=r.id and r.contract_address=? group by p.url,p.oss_volume) a -- ③select p.record_data,p.cooperation_name,p.cooperation_id,p.block_time,p.oss_volume as userOSSCapacity,p.txhash,p.seq_id,p.pre_seq_id,p.custom_string FROM poehistoryinfo p LEFT JOIN recordregistryinfo r ON p.record_registry_info_id=r.id LEFT JOIN identityinfo i ON i.account_id=r.mgmt_account_id where r.contract_address=? GROUP BY r.contract_address,p.record_data,p.cooperation_name,p.cooperation_id,p.block_time,p.txhash,p.seq_id,p.pre_seq_id,p.custom_string,p.oss_volume ORDER BY p.block_time DESC limit ?,?-- ④select count(r.id) FROM poehistoryinfo p LEFT JOIN recordregistryinfo r ON p.record_registry_info_id=r.id LEFT JOIN identityinfo i ON i.account_id=r.mgmt_account_id where r.contract_address=?
首先非常庆幸服务B使用的不是JPA,如果是使用JPA而且没有用@query 原生态sql的话那就排查比较麻烦了,不过还好能看到sql语句很亲切,但sql语句有些很长,我在每条sql上进行标号为方便后续讲解;
第一步:需要自己模拟一个数据库和数据进行测试,毕竟线上的东西是不能碰滴;
第二步: 在VirtualBox 上搭建环境如下:
名称 | 描述 |
---|---|
mysql | 5.7,MySQL APT Repository 方式安装 |
系统环境 | linux 16.04 |
内存 | 预设6G |
硬盘 | 预设80G |
又因为在执行该接口时,服务B ECS无异常波动,RDS cup、内存也无异常波动,所以排除间歇性问题;而着重去考虑sql优化问题;
说实在的,这里的sql语句确实渣渣了点,也无需先做explain、profile 分析,先直接来优化完在按执行计划分析;
优化原则:
- 索引覆盖
- 索引,多条件下坚持使用向左原则,必要时建立复合索引
- 筛选大范围,再逐层筛选
优化方式:
- 使用explain 查看执行计划,关注Type、rows、Extra字段进行优化;
- 还不够理想,可以使用如下方式,排查
set profiling = 1; show profiles; show profile for query 1;
- 使用
show status
分析计数器
-- ① 使用左连接select count(p.id) from poeinfo p left join on p.record_registry_info_id=r.id where r.contract_address=?;-- ② 使用 去除临时表,用distinct,使用内连接SELECT distinct p.url ,p.oss_volume from poehistoryinfo p INNER JOIN recordregistryinfo r ON p.record_registry_info_id = r.id where r.contract_address = ?-- ③ 去除group by 发现与业务无关,无实际意义; 分页查询先使用全盖索覆引获取ID,在根据ID获取结果集SELECT p.record_data,p.cooperation_name,p.cooperation_id,p.block_time,p.oss_volume as userOSSCapacity,p.txhash,p.seq_id,p.pre_seq_id,p.custom_string FROM poehistoryinfo p INNER JOIN (SELECT p1.id FROM poehistoryinfo p1 LEFT JOIN recordregistryinfo r ON p1.record_registry_info_id=r.id LEFT JOIN identityinfo i ON i.account_id=r.mgmt_account_id where r.contract_address=? ORDER BY p1.block_time DESC limit ?,?) a ON a.id = p.id ;-- ④ 无需优化select count(r.id) FROM poehistoryinfo p LEFT JOIN recordregistryinfo r ON p.record_registry_info_id=r.id LEFT JOIN identityinfo i ON i.account_id=r.mgmt_account_id where r.contract_address=?
下面对③进行详细分析; 使用执行计划得
explain select p.record_data,p.cooperation_name,p.cooperation_id,p.block_time,p.oss_volume as userOSSCapacity,p.txhash,p.seq_id,p.pre_seq_id,p.custom_string FROM poehistoryinfo p LEFT JOIN recordregistryinfo r ON p.record_registry_info_id=r.id LEFT JOIN identityinfo i ON i.account_id=r.mgmt_account_id where r.contract_address='85b692c81d16bfa49ce0b8a166458c71f522fd5b' GROUP BY r.contract_address,p.record_data,p.cooperation_name,p.cooperation_id,p.block_time,p.txhash,p.seq_id,p.pre_seq_id,p.custom_string,p.oss_volume ORDER BY p.block_time DESC limit 1,25;+----+-------------+-------+------------+------+-----------------------------+-----------------------------+---------+-----------------------------+--------+----------+----------------------------------------------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+------+-----------------------------+-----------------------------+---------+-----------------------------+--------+----------+----------------------------------------------------+| 1 | SIMPLE | r | NULL | ALL | PRIMARY | NULL | NULL | NULL | 65 | 10.00 | Using where; Using temporary; Using filesort || 1 | SIMPLE | i | NULL | ref | identityinfo_account_id_ind | identityinfo_account_id_ind | 153 | sync_test.r.mgmt_account_id | 1 | 100.00 | Using where; Using index || 1 | SIMPLE | p | NULL | ALL | NULL | NULL | NULL | NULL | 226366 | 10.00 | Using where; Using join buffer (Block Nested Loop) |+----+-------------+-------+------------+------+-----------------------------+-----------------------------+---------+-----------------------------+--------+----------+----------------------------------------------------+3 rows in set, 1 warning (0.00 sec)set profiling=1; select p.record_data,p.cooperation_name,p.cooperation_id,p.block_time,p.oss_volume as userOSSCapacity,p.txhash,p.seq_id,p.pre_seq_id,p.custom_string FROM poehistoryinfo p LEFT JOIN recordregistryinfo r ON p.record_registry_info_id=r.id LEFT JOIN identityinfo i ON i.account_id=r.mgmt_account_id where r.contract_address='85b692c81d16bfa49ce0b8a166458c71f522fd5b' GROUP BY r.contract_address,p.record_data,p.cooperation_name,p.cooperation_id,p.block_time,p.txhash,p.seq_id,p.pre_seq_id,p.custom_string,p.oss_volume ORDER BY p.block_time DESC limit 1,25;show profiles;show profile for query 1;+---------------------------+----------+| Status | Duration |+---------------------------+----------+| starting | 0.000246 || checking permissions | 0.000014 || checking permissions | 0.000004 || checking permissions | 0.000007 || Opening tables | 0.000043 || init | 0.000088 || System lock | 0.000020 || optimizing | 0.000026 || statistics | 0.000083 || preparing | 0.000047 || Creating tmp table | 0.000068 || Sorting result | 0.000010 || executing | 0.000002 || Sending data | 0.125882 || converting HEAP to ondisk | 0.015473 || Sending data | 0.773666 || Creating sort index | 0.004720 || end | 0.015822 || query end | 0.000057 || removing tmp table | 0.000526 || query end | 0.000015 || closing tables | 0.000020 || freeing items | 0.000069 || cleaning up | 0.000035 |+---------------------------+----------+24 rows in set, 1 warning (0.00 sec)
type: 2个ALL全表查询,一个索引访问;而最后一个全表rows有226366,需要优化;
Using where:索引覆盖查询
Using index:索引查询
Using join buffer (Block Nested Loop):关联查询
Using filesort:外部索引排序
Using temporary:查询结果排序时的临时表;
花费时间最多的是2个 sending data;
目前可以总结分析: 优化点有2个:
- 2个ALL全表查询,特别是rows为226366的
- 结合实际业务发现
GROUP BY r.contract_address,p.record_data,p.cooperation_name,p.cooperation_id,p.block_time,p.txhash,p.seq_id,p.pre_seq_id,p.custom_string,p.oss_volume
无实际意义 - 优化Limit分页;我们通常会使用LIMIT加上偏移量的办法实现,同时加上合适的ORDER BY字句。如果有对应的索引,通常效率会不错,否则,MySQL需要做大量的文件排序操作。But 在偏移量非常大的时候,那么代价也是非常大的;优化此类分页查询的一个最简单办法就是尽可能地使用索引覆盖扫描,而不是查询所有的列;
优化后
SELECT p.record_data,p.cooperation_name,p.cooperation_id,p.block_time,p.oss_volume as userOSSCapacity,p.txhash,p.seq_id,p.pre_seq_id,p.custom_string FROM poehistoryinfo p INNER JOIN (SELECT p1.id FROM poehistoryinfo p1 LEFT JOIN recordregistryinfo r ON p1.record_registry_info_id=r.id LEFT JOIN identityinfo i ON i.account_id=r.mgmt_account_id where r.contract_address=? ORDER BY p1.block_time DESC limit ?,?) a ON a.id = p.id ;
explain SELECT p.record_data,p.cooperation_name,p.cooperation_id,p.block_time,p.oss_volume as userOSSCapacity,p.txhash,p.seq_id,p.pre_seq_id,p.custom_string FROM poehistoryinfo p INNER JOIN (SELECT p1.id FROM poehistoryinfo p1 LEFT JOIN recordregistryinfo r ON p1.record_registry_info_id=r.id LEFT JOIN identityinfo i ON i.account_id=r.mgmt_account_id where r.contract_address='85b692c81d16bfa49ce0b8a166458c71f522fd5b' ORDER BY p1.block_time DESC limit 1,25) a ON a.id = p.id ;+----+-------------+------------+------------+--------+-----------------------------+-----------------------------+---------+--------------------------------------+------+----------+--------------------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+------------+------------+--------+-----------------------------+-----------------------------+---------+--------------------------------------+------+----------+--------------------------+| 1 | PRIMARY || NULL | ALL | NULL | NULL | NULL | NULL | 26 | 100.00 | NULL || 1 | PRIMARY | p | NULL | eq_ref | PRIMARY | PRIMARY | 4 | a.id | 1 | 100.00 | NULL || 2 | DERIVED | p1 | NULL | index | NULL | poeinfo_block_time_ind | 6 | NULL | 259 | 100.00 | NULL || 2 | DERIVED | r | NULL | eq_ref | PRIMARY | PRIMARY | 4 | sync_test.p1.record_registry_info_id | 1 | 10.00 | Using where || 2 | DERIVED | i | NULL | ref | identityinfo_account_id_ind | identityinfo_account_id_ind | 153 | sync_test.r.mgmt_account_id | 1 | 100.00 | Using where; Using index |+----+-------------+------------+------------+--------+-----------------------------+-----------------------------+---------+--------------------------------------+------+----------+--------------------------+5 rows in set, 1 warning (0.00 sec)
一看问题基本已解决执行结果发现优化后的速度比优化前块了47倍;但是放回到阿里云RDS上时,出现2个问题:
- 在不优化的情况下自建数据库执行③slq语句比RDS数据库要快5倍以上;
- 优化之后的sql语句放到RDS上,比之前未优化的sql语句还慢; 这时的我就一脸懵了,what happen ? what should I do ?
-- 优化前执行时间25 rows in set (0.47 sec)-- 优化后执行时间25 rows in set (0.01 sec)
在运维的建议下,我打算提工单咨询一下阿里云RDS工作人员; 具体内容等出结果再写;