MYSQL分组后只要前四条 mysql分组后组内排序

转载

mob64ca13fdd43c 2024-02-20 07:25:45

文章标签 MYSQL分组后只要前四条 mysql group by order by 组内排序 文章分类 MySQL 数据库

一、前言

需求是：获取用户充值总数最多的前100名，并且获取最新充值的时间和充值数量

分析下:

（1）首先是要获取用户充值的总数，然后进行排序取前100；
（2）要获取用户的最新一次充值时间和金额，那么要以用户作为分组，然后通过聚合取最新充值时间和数量

分析之后，大概就了解了，要先group by分组，然后进行组内排序。

二、解决过程

1、先按大意写一条sql

主要字段说明：

pay_time :支付时间
pay_money:支付金额
receiver_id:用户id

mysql> explain select max(pay_time) as latest_time,pay_money,sum(pay_money) as total_pay_money,
		receiver_id from pay_info  group by receiver_id order by total_pay_money desc limit 100 ;
+----+-------------+----------+------------+------+-------------------------+-------------+---------+-------+------+----------+---------------------------------------------------------------------+
| id | select_type | table    | partitions | type | possible_keys           | key         | key_len | ref   | rows | filtered | Extra                                                               |
+----+-------------+----------+------------+------+-------------------------+-------------+---------+-------+------+----------+---------------------------------------------------------------------+
|  1 | SIMPLE      | pay_info | NULL       | ref  | receiver_id,order_state | order_state | 1       | const |  424 |    11.11 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+----------+------------+------+-------------------------+-------------+---------+-------+------+----------+---------------------------------------------------------------------+

很明显效率很低，这也是咱们碰到组内排序最经常遇到的问题，就是效率问题。其次就是这个pay_money的问题，相当于是在组内取的这个值，问题这个字段并没有通过聚合函数来拿，所以这个值是不准确的，算是随机返回的一个值，所以说不成立。

2、explain中新增的filtered字段和partitions 字段是啥意思

咱们在执行计划中发现几个不怎么熟的字段，先了解一下：

1、 filtered：这个值是越小越好，代表查询操作筛选的更彻底，获取的结果集更小，如果是100%那就该优化了（如果表很小，且使用到主键的话，100%也可以的，不用优化）。
使用explain extended时会出现这个列，5.7之后的版本默认就有这个字段，不需要使用explain extended了。这个字段表示存储引擎返回的数据在server层过滤后，剩下多少满足查询的记录数量的比例，注意是百分比，不是具体记录数。
2、 partitions ：版本5.7以前，该项是explain partitions显示的选项，5.7以后成为了默认选项。该列显示的为分区表命中的分区情况。非分区表该字段为空（null）。
关于分区参考：https://x125858805.iteye.com/blog/2068120 （这个参数对于没有分区的表来说，不用在意）

3、先优化group by ，加联合索引

（1） 加联合索引

正确的创建联合索引：

正确的：create index total_index on pay_info(order_state,pay_time,receiver_id);
	正确的： ALTER TABLE `table` ADD INDEX index_name (`column1`, `column2`, `column3`)
	删除索引：drop index total_index_2 on pay_info;

网上百度的时候，发现好多写法都是错误的，以下是错误的部分，大家参考下：

错误的： 	alert table `table` add index total_index(order_state,pay_time,receiver_id);
			错误的：		 alert table pay_info add index `total_index` (`order_state`,`pay_time`,`receiver_id`);

（2） 因为这里多了个聚合后的字段排序，因此还是会使用文件排序，临时表，优化无效。

4、下下策

（1） 解决不了排序问题，就把排序拿出来，反正数据就100条，数组操作不会很耗费性能

mysql>  explain select max(pay_time) as latest_time,pay_money,sum(pay_money) as total_pay_money,receiver_id from pay_info force index(receiver_id) where 1 and order_state = 1 and server_unique_flag=1 and pay_time between 1557990000 and 1558681199  group by receiver_id  limit 100 ;
+----+-------------+----------+------------+-------+---------------+-------------+---------+------+------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys | key         | key_len | ref  | rows | filtered | Extra       |
+----+-------------+----------+------------+-------+---------------+-------------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | pay_info | NULL       | index | receiver_id   | receiver_id | 8       | NULL |  635 |     0.16 | Using where |
+----+-------------+----------+------------+-------+---------------+-------------+---------+------+------+----------+-------------+

这里可以使用force index(你要用到的索引名),也可以根据搜索条件加联合索引。比如这里我们可以给receiver_id和order_state两个字段加个索引：

create index total_index on pay_info(order_state,receiver_id);

查看执行计划：

mysql> explain select max(pay_time) as latest_time,pay_money,sum(pay_money) as total_pay_money,receiver_id from pay_info  where 1 and order_state = 1 and server_unique_flag=1 and pay_time between 1557990000 and 1558681199  group by receiver_id  limit 100 ;
+----+-------------+----------+------------+------+-------------------------------------+-------------+---------+-------+------+----------+------------------------------------+
| id | select_type | table    | partitions | type | possible_keys                       | key         | key_len | ref   | rows | filtered | Extra                              |
+----+-------------+----------+------------+------+-------------------------------------+-------------+---------+-------+------+----------+------------------------------------+
|  1 | SIMPLE      | pay_info | NULL       | ref  | receiver_id,order_state,total_index | total_index | 1       | const |  424 |     1.11 | Using index condition; Using where |
+----+-------------+----------+------------+------+-------------------------------------+-------------+---------+-------+------+----------+------------------------------------+

解析器是先解析group by，后解析order by，这边group by的步骤算是优化好了，order by 直接使用group by组内的数据就行。如果order by的字段在group by组内，那么不会产生临时表和文件排序，我本地测试过的，如果order by的字段是这个聚合后产生的字段，那么就会出现临时表和文件排序

（2）、 后续处理，通过php实现排序：

$last_names = array_column($arr_data,'total_pay_money');  // 对 total_pay_money进行排序处理
  array_multisort($last_names,SORT_DESC,$arr_data);

大概解释：通过array_column取出要排序的字段数组，然后通过array_multisort函数先对取出的列进行排序，然后再作用到整个数组排序。

参考：

问题： 本来想着分开，就算效率低了点，也能接受，后来在测试数据的时候，发现无法保证pay_money是最新充值的那条记录。因为在group by之前select的字段，只能保证聚合部分是自己想要的，对于普通的字段，会随机给一个，而我们是要最后一次充值的时间和最后一次充值的金额，因此这种方案也pass掉。

其实如果分开排序，再优化下这个sql也是可以的，但是把排序分开已经是很屈辱了，完全不能忍，必须写在一个sql里面！

三、最终方案

explain select a.pay_time,total_pay_money,a.receiver_id,a.pay_money from pay_info as a right join( 
select max(pay_id) as max_pay_id,sum(pay_money) as total_pay_money,receiver_id from pay_info force index(receiver_id) where 1 and order_state = 1 and server_unique_flag > 0 and pay_date between '2019-05-13 00:00:00' and '2019-05-27 23:59:59' group by receiver_id 
) as b on a.pay_id = b.max_pay_id where 1 and a.order_state = 1 and server_unique_flag > 0 and
 pay_date between '2019-05-13 00:00:00' and '2019-05-27 23:59:59' order by total_pay_money desc limit 100;
+----+-------------+------------+------------+--------+------------------------------------------+-------------+---------+--------------+------+----------+----------------+
| id | select_type | table      | partitions | type   | possible_keys                            | key         | key_len | ref          | rows | filtered | Extra          |
+----+-------------+------------+------------+--------+------------------------------------------+-------------+---------+--------------+------+----------+----------------+
|  1 | PRIMARY     | <derived2> | NULL       | ALL    | NULL                                     | NULL        | NULL    | NULL         |    5 |   100.00 | Using filesort |
|  1 | PRIMARY     | a          | NULL       | eq_ref | PRIMARY,order_state,pay_date,total_index | PRIMARY     | 8       | b.max_pay_id |    1 |     5.00 | Using where    |
|  2 | DERIVED     | pay_info   | NULL       | index  | receiver_id,total_index                  | receiver_id | 8       | NULL         |  635 |     0.37 | Using where    |
+----+-------------+------------+------------+--------+------------------------------------------+-------------+---------+--------------+------+----------+----------------+

上面的sql看起来复杂，但确实是实现了咱们的需求。下面咱们把where条件去掉看看：

explain select a.pay_time,total_pay_money,a.receiver_id,a.pay_money from pay_info as a right join
( 
select max(pay_id) as max_pay_id,sum(pay_money) as total_pay_money  from pay_info force index(receiver_id) group by receiver_id 
) as b 
on a.pay_id = b.max_pay_id where 1 order by total_pay_money desc limit 100;

分析：

1、首先思路是用一个子查询，先实现分组的效果，按用户分组，然后获取用户充值总额，用户最近一次充值时间等。考虑到表的id是自增的，所以子查询中select 一个最大的pay_id，pay_id最大，代表用户是最后一次充值。

2、子查询获取结果后，我们再用pay_info表去关联子查询的结果，关键的外键是pay_id，也就说我们根据这个pay_id关键的数据就能获取每个用户最近一次充值的数据以及每个用户的充值总数，通过关键，我们能轻松拿到pay_time,pay_money等数据
3、接下来是进行排序，使用total_pay_money（用户总充值）排序，获取充值排名前100的用户数据。
4、最后测试，explain查看执行计划，还算不错

注意： 使用子查询实现组内排序，顺便给group by的字段加强制索引，提升效率。对于子查询来说，子查询里面的句子如果不加where条件的话，代表全表扫描，非常耗费性能，所以要把外面的where条件加进去 。外面本来的where条件还是要保留的，防止关键的字段可以有重复值的话，会造成数据重复。比如这里，pay_id是主键，不存在重复的情况，在子查询外也是通过这个字段关联的，所以子查询外的where条件是可以去掉的。如果关联的字段是可以重复的值，那么此时就不能保证数据全部符合我们要的筛选条件，所以最好还是在子查询内外加上where条件。

============================ 2019年6月02日20时更新 ==========================

mysql> explain select a.pay_time,a.server_unique_flag,total_pay_money,a.receiver_id,a.pay_money from pay_info as a right join( select max(pay_id) as max_pay_id,sum(pay_money) as total_pay_money,receiver_id from pay_info force index(receiver_id) where 1 and order_state > 0 and server_unique_flag > 0 and pay_date between '2019-05-14 00:00:00' and '2019-05-29 23:59:59' group by receiver_id order by total_pay_money desc limit 100 ) as b on a.pay_id = b.max_pay_id where 1 and a.order_state > 0 and server_unique_flag > 0 and pay_date between '2019-05-14 00:00:00' and '2019-05-29 23:59:59' order by total_pay_money desc limit 100;
+----+-------------+------------+------------+--------+------------------------------------------+-------------+---------+--------------+------+----------+----------------------------------------------+
| id | select_type | table      | partitions | type   | possible_keys                            | key         | key_len | ref          | rows | filtered | Extra                                        |
+----+-------------+------------+------------+--------+------------------------------------------+-------------+---------+--------------+------+----------+----------------------------------------------+
|  1 | PRIMARY     | <derived2> | NULL       | ALL    | NULL                                     | NULL        | NULL    | NULL         |    7 |   100.00 | Using filesort                               |
|  1 | PRIMARY     | a          | NULL       | eq_ref | PRIMARY,order_state,pay_date,total_index | PRIMARY     | 8       | b.max_pay_id |    1 |     5.00 | Using where                                  |
|  2 | DERIVED     | pay_info   | NULL       | index  | receiver_id,total_index                  | receiver_id | 8       | NULL         |  635 |     1.23 | Using where; Using temporary; Using filesort |
+----+-------------+------------+------------+--------+------------------------------------------+-------------+---------+--------------+------+----------+----------------------------------------------+

把order by和limit加入到子查询中，能有效减少子查询的结果集数量。因为业务上只需要获取前100条数据，所以这里把限制条件加进子查询确实能提升一些速度。经测试，数据量在100W的时候，速度从4s提升到了3s。数据量小于10w的话，提升并不明显

============================ 2019年6月06日01时更新 ==========================

以上的方案还是有些问题，用force index没问题，关键是用的时候要考虑该索引起到的筛选作用有多大，我们的查询优化，本质上就是扫描的行尽量少一点，获取的结果集尽量少一点，这样速度自然就提升了。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。