Once hired, it encountered the optimization of MySQL's billion level large tables

1, background

In the xxx alarm of XX instance (one master and one slave), SLA alarm is reported every morning, which means there is a certain master-slave delay. (if the master-slave switch occurs at this time, it will take a long time to complete the switch, and the delay should be traced to ensure the consistency of the master-slave data.)

XX instance has the largest number of slow queries (SQL with execution time more than 1s will be recorded). XX application party is doing the task of deleting data one month ago every night.

2, analysis

Use the PT query digest tool to analyze mysql-slow.log in the last week:

pt-query-digest --since=148h mysql-slow.log | less

Results part one:

In the last week, the total recorded slow query execution time is 25403s, the maximum slow SQL execution time is 266s, the average execution time of each slow SQL is 5s, and the average number of scanned rows is 17.66 million.

Results part two:

The number of slow queries recorded by select arrival record is up to 40000 times, with an average response time of 4s. The number of slow queries recorded by delete arrival record is 6 times, with an average response time of 258s.

3. Select XXX record statement

The slow query statement of select arrival record is similar to the following. The parameter fields in the where statement are the same, and the parameter values passed in are different:

select count(*) from arrival_record where product_id=26 and receive_time between '2019-03-25 14:00:00' and '2019-03-25 15:00:00' and receive_spend_ms>=0\G

Select arrival record statement scans 56 million rows at most in MySQL, and the average number of rows scanned is 1.72 million. It is inferred that the execution time is long due to the large number of rows scanned.

To view the execution plan:

explain select count(*) from arrival_record where product_id=26 and receive_time between '2019-03-25 14:00:00' and '2019-03-25 15:00:00' and receive_spend_ms>=0\G; 
*************************** 1. row *************************** 
id: 1 
select_type: SIMPLE 
table: arrival_record 
partitions: NULL 
type: ref 
possible_keys: IXFK_arrival_record 
key: IXFK_arrival_record 
key_len: 8 
ref: const 
rows: 32261320 
filtered: 3.70 
Extra: Using index condition; Using where 
1 row in set, 1 warning (0.00 sec)

The index ixfk ﹣ arrival ﹣ record is used, but it is estimated that there are more than 30 million rows scanned:

show index from arrival_record; 
+----------------+------------+---------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ 
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | 
+----------------+------------+---------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ 
| arrival_record | 0 | PRIMARY | 1 | id | A | 107990720 | NULL | NULL | | BTREE | | | 
| arrival_record | 1 | IXFK_arrival_record | 1 | product_id | A | 1344 | NULL | NULL | | BTREE | | | 
| arrival_record | 1 | IXFK_arrival_record | 2 | station_no | A | 22161 | NULL | NULL | YES | BTREE | | | 
| arrival_record | 1 | IXFK_arrival_record | 3 | sequence | A | 77233384 | NULL | NULL | | BTREE | | | 
| arrival_record | 1 | IXFK_arrival_record | 4 | receive_time | A | 65854652 | NULL | NULL | YES | BTREE | | | 
| arrival_record | 1 | IXFK_arrival_record | 5 | arrival_time | A | 73861904 | NULL | NULL | YES | BTREE | | | 
+----------------+------------+---------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ 
 
show create table arrival_record; 
.......... 
arrival_spend_ms bigint(20) DEFAULT NULL, 
total_spend_ms bigint(20) DEFAULT NULL, 
PRIMARY KEY (id), 
KEY IXFK_arrival_record (product_id,station_no,sequence,receive_time,arrival_time) USING BTREE, 
CONSTRAINT FK_arrival_record_product FOREIGN KEY (product_id) REFERENCES product (id) ON DELETE NO ACTION ON UPDATE NO ACTION 
) ENGINE=InnoDB AUTO_INCREMENT=614538979 DEFAULT CHARSET=utf8 COLLATE=utf8_bin |

① The total number of records in this table is about 100 million. There is only one composite index on the table. The product ID field has a small cardinality and poor selectivity.

② Incoming filter criteria:

where product_id=26 and receive_time between '2019-03-25 14:00:00' and '2019-03-25 15:00:00' and receive_spend_ms>=0 

There is no station Nu field, and the product ID, station no, sequence, receive time fields of ixfk arrival record are not used.

③ According to the principle of the leftmost prefix, select arrival record only uses the first field product ID of the composite index ixfk arrival record, and the selectivity of this field is very poor, resulting in a large number of scanned lines and long execution time.

④ The receive [time] field has a large base and good selectivity. You can index the field separately. Select arrival [record] SQL will use the index.

Now we know that the parameter fields passed in the select arrival record where statement recorded in the slow query are product ID, receive time, receive suspend Ms. do you want to know whether the access to the table is filtered by other fields?

When tcpdump comes out, it's time to use tcpdump to grab the select statement of the table for a period of time:

tcpdump -i bond0 -s 0 -l -w - dst port 3316 | strings | grep select | egrep -i 'arrival_record' >/tmp/select_arri.log

Get the where condition statement after from in the select statement:

IFS_OLD=$IFS 
IFS=$'\n' 
for i in `cat /tmp/select_arri.log `;do echo ${i#*'from'}; done | less 
IFS=$IFS_OLD 
arrival_record arrivalrec0_ where arrivalrec0_.sequence='2019-03-27 08:40' and arrivalrec0_.product_id=17 and arrivalrec0_.station_no='56742' 
arrival_record arrivalrec0_ where arrivalrec0_.sequence='2019-03-27 08:40' and arrivalrec0_.product_id=22 and arrivalrec0_.station_no='S7100' 
arrival_record arrivalrec0_ where arrivalrec0_.sequence='2019-03-27 08:40' and arrivalrec0_.product_id=24 and arrivalrec0_.station_no='V4631' 
arrival_record arrivalrec0_ where arrivalrec0_.sequence='2019-03-27 08:40' and arrivalrec0_.product_id=22 and arrivalrec0_.station_no='S9466' 
arrival_record arrivalrec0_ where arrivalrec0_.sequence='2019-03-27 08:40' and arrivalrec0_.product_id=24 and arrivalrec0_.station_no='V4205' 
arrival_record arrivalrec0_ where arrivalrec0_.sequence='2019-03-27 08:40' and arrivalrec0_.product_id=24 and arrivalrec0_.station_no='V4105' 
arrival_record arrivalrec0_ where arrivalrec0_.sequence='2019-03-27 08:40' and arrivalrec0_.product_id=24 and arrivalrec0_.station_no='V4506' 
arrival_record arrivalrec0_ where arrivalrec0_.sequence='2019-03-27 08:40' and arrivalrec0_.product_id=24 and arrivalrec0_.station_no='V4617' 
arrival_record arrivalrec0_ where arrivalrec0_.sequence='2019-03-27 08:40' and arrivalrec0_.product_id=22 and arrivalrec0_.station_no='S8356' 
arrival_record arrivalrec0_ where arrivalrec0_.sequence='2019-03-27 08:40' and arrivalrec0_.product_id=22 and arrivalrec0_.station_no='S8356'

In the where condition of select table, there are product [ID], station [no] and sequence fields. You can use the first three fields of the composite index ixfk [arrival] record.

In conclusion, the optimization method is as follows:

Delete composite index ixfk? Arrival? Record

Create a composite index idx? Sequence? Station? No? Product? ID

Establish a separate index

4. Delete XXX record statement

The average number of scan lines of the delete operation is 110 million, and the average execution time is 262s.

The delete statement is as follows. The parameter values passed in by slow query of each record are different:

delete from arrival_record where receive_time < STR_TO_DATE('2019-02-23', '%Y-%m-%d')\G Execution plan:

explain select * from arrival_record where receive_time < STR_TO_DATE('2019-02-23', '%Y-%m-%d')\G 
*************************** 1. row *************************** 
id: 1 
select_type: SIMPLE 
table: arrival_record 
partitions: NULL 
type: ALL 
possible_keys: NULL 
key: NULL 
key_len: NULL 
ref: NULL 
rows: 109501508 
filtered: 33.33 
Extra: Using where 
1 row in set, 1 warning (0.00 sec)

The delete statement does not use an index (no suitable index is available), and it goes through a full table scan, resulting in a long execution time.

The optimization method is also to establish a separate index indx ﹐ receive ﹐ time (receive ﹐ time).

5, test

Copy the arrival record table to the test instance for deletion and reindex.

XX instance arrival record table information:

du -sh /datas/mysql/data/3316/cq_new_cimiss/arrival_record* 
12K    /datas/mysql/data/3316/cq_new_cimiss/arrival_record.frm 
48G    /datas/mysql/data/3316/cq_new_cimiss/arrival_record.ibd 
 
select count() from cq_new_cimiss.arrival_record; 
+-----------+ 
| count() | 
+-----------+ 
| 112294946 | 
+-----------+ 

More than 100 million records

SELECT 
table_name, 
CONCAT(FORMAT(SUM(data_length) / 1024 / 1024,2),'M') AS dbdata_size, 
CONCAT(FORMAT(SUM(index_length) / 1024 / 1024,2),'M') AS dbindex_size, 
CONCAT(FORMAT(SUM(data_length + index_length) / 1024 / 1024 / 1024,2),'G') AS table_size(G), 
AVG_ROW_LENGTH,table_rows,update_time 
FROM 
information_schema.tables 
WHERE table_schema = 'cq_new_cimiss' and table_name='arrival_record'; 

+----------------+-------------+--------------+------------+----------------+------------+---------------------+ 
| table_name | dbdata_size | dbindex_size | table_size(G) | AVG_ROW_LENGTH | table_rows | update_time | 
+----------------+-------------+--------------+------------+----------------+------------+---------------------+ 
| arrival_record | 18,268.02M | 13,868.05M | 31.38G | 175 | 109155053 | 2019-03-26 12:40:17 | 
+----------------+-------------+--------------+------------+----------------+------------+---------------------+

The disk occupies 48G of space. The size of the table in MySQL is 31G. There are 17G fragments, mostly caused by deletion. (record deleted, space not recycled)

Backup and restore the table to a new instance, delete the original composite index, and add the index again for testing.

mydumper parallel compression backup:

user=root 
 passwd=xxxx 
socket=/datas/mysql/data/3316/mysqld.sock 
db=cq_new_cimiss 
table_name=arrival_record 
backupdir=/datas/dump_$table_name 
mkdir -p $backupdir 
 
  nohup echo `date +%T` && mydumper -u $user -p $passwd -S $socket  -B $db -c  -T $table_name  -o $backupdir  -t 32 -r 2000000 && echo `date +%T` &

The time (52s) and space occupied by parallel compression backup (1.2G, the actual disk space occupied by this table is 48G, and the compression ratio of mydumper parallel compression backup is quite high):

Started dump at: 2019-03-26 12:46:04 
........ 
 
Finished dump at: 2019-03-26 12:46:56 
 
du -sh   /datas/dump_arrival_record/ 
1.2G    /datas/dump_arrival_record/

Copy the dump data to the test node:

scp -rp /datas/dump_arrival_record root@10.230.124.19:/datas 

Multi thread import data:

time myloader -u root -S /datas/mysql/data/3308/mysqld.sock -P 3308 -p root -B test -d /datas/dump_arrival_record -t 32 
 
real 126m42.885s 
user 1m4.543s 
sys 0m4.267s

Disk space after logical import of the table:

du -h -d 1 /datas/mysql/data/3308/test/arrival_record.* 
12K /datas/mysql/data/3308/test/arrival_record.frm 
30G /datas/mysql/data/3308/test/arrival_record.ibd 

No fragments, consistent with the size of mysql's table

cp -rp /datas/mysql/data/3308 /datas

Use online DDL and Pt OSC tools to delete and rebuild indexes respectively.

The foreign key cannot be deleted without deleting the foreign key. The foreign key column belongs to the first column in the composite index

nohup bash /tmp/ddl_index.sh & 
2019-04-04-10:41:39 begin stop mysqld_3308 
2019-04-04-10:41:41 begin rm -rf datadir and cp -rp datadir_bak 
2019-04-04-10:46:53 start mysqld_3308 
2019-04-04-10:46:59 online ddl begin 
2019-04-04-11:20:34 onlie ddl stop 
2019-04-04-11:20:34 begin stop mysqld_3308 
2019-04-04-11:20:36 begin rm -rf datadir and cp -rp datadir_bak 
2019-04-04-11:22:48 start mysqld_3308 
2019-04-04-11:22:53 pt-osc begin 
2019-04-04-12:19:15 pt-osc stop 

online DDL takes 34 minutes, Pt OSC takes 57 minutes, and onlne DDL takes about half of Pt OSC tool time.

For DDL reference:

6, implementation

Since it is a master-slave instance and the application is a connected vip, online DDL is used to delete and rebuild the index.

After the master-slave replication is stopped, first do it on the slave instance (no binlog is recorded), then do it on the newly switched slave instance (no binlog is recorded):

function red_echo () { 
 
        local what="$*" 
        echo -e "$(date +%F-%T)  ${what}" 
} 
 
function check_las_comm(){ 
    if [ "$1" != "0" ];then 
        red_echo "$2" 
        echo "exit 1" 
        exit 1 
    fi 
} 
 
red_echo "stop slave" 
mysql -uroot -p$passwd --socket=/datas/mysql/data/${port}/mysqld.sock -e"stop slave" 
check_las_comm "$?" "stop slave failed" 
 
red_echo "online ddl begin" 
 mysql -uroot -p$passwd --socket=/datas/mysql/data/${port}/mysqld.sock -e"set sql_log_bin=0;select now() as  ddl_start;ALTER TABLE $db_.\`${table_name}\` DROP FOREIGN KEY FK_arrival_record_product,drop index IXFK_arrival_record,add index idx_product_id_sequence_station_no(product_id,sequence,station_no),add index idx_receive_time(receive_time);select now() as ddl_stop" >>${log_file} 2>& 1 
 red_echo "onlie ddl stop" 
 red_echo "add foreign key" 
 mysql -uroot -p$passwd --socket=/datas/mysql/data/${port}/mysqld.sock -e"set sql_log_bin=0;ALTER TABLE $db_.${table_name} ADD CONSTRAINT _FK_${table_name}_product FOREIGN KEY (product_id) REFERENCES cq_new_cimiss.product (id) ON DELETE NO ACTION ON UPDATE NO ACTION;" >>${log_file} 2>& 1 
 check_las_comm "$?" "add foreign key error" 
 red_echo "add foreign key stop" 
 
red_echo "start slave" 
mysql -uroot -p$passwd --socket=/datas/mysql/data/${port}/mysqld.sock -e"start slave" 
check_las_comm "$?" "start slave failed" 

Execution time:

2019-04-08-11:17:36 stop slave 
mysql: [Warning] Using a password on the command line interface can be insecure. 
ddl_start 
2019-04-08 11:17:36 
ddl_stop 
2019-04-08 11:45:13 
2019-04-08-11:45:13 onlie ddl stop 
2019-04-08-11:45:13 add foreign key 
mysql: [Warning] Using a password on the command line interface can be insecure. 
2019-04-08-12:33:48 add foreign key stop 
2019-04-08-12:33:48 start slave 

It takes 28 minutes to delete and rebuild index, and 48 minutes to add foreign key constraint.

To view the execution plan of delete and select statements again:

explain select count(*) from arrival_record where receive_time < STR_TO_DATE('2019-03-10', '%Y-%m-%d')\G 
*************************** 1. row *************************** 
id: 1 
select_type: SIMPLE 
table: arrival_record 
partitions: NULL 
type: range 
possible_keys: idx_receive_time 
key: idx_receive_time 
key_len: 6 
ref: NULL 
rows: 7540948 
filtered: 100.00 
Extra: Using where; Using index 
 
explain select count(*) from arrival_record where product_id=26 and receive_time between '2019-03-25 14:00:00' and '2019-03-25 15:00:00' and receive_spend_ms>=0\G; 
*************************** 1. row *************************** 
id: 1 
select_type: SIMPLE 
table: arrival_record 
partitions: NULL 
type: range 
possible_keys: idx_product_id_sequence_station_no,idx_receive_time 
key: idx_receive_time 
key_len: 6 
ref: NULL 
rows: 291448 
filtered: 16.66 
Extra: Using index condition; Using where

The IDX receive time index is used, and the number of rows scanned is greatly reduced.

7. After index optimization

delete still takes 77s:

delete from arrival_record where receive_time < STR_TO_DATE('2019-03-10', '%Y-%m-%d')\G

The delete statement takes 77s to delete more than 3 million records through the index of receive time.

8. delete large table optimized for small batch deletion

The application end has been optimized to delete 10 minutes of data each time (execution time is about 1s each time), and there is no SLA (master-slave delay alarm) in xxx:

Another method is to delete 20000 records at a time in the order of primary keys:

#Get the maximum primary key ID satisfying the time condition 
#By scanning small batches of deleted data in the order of primary keys 
#Execute the following statement once 
 SELECT MAX(id) INTO @need_delete_max_id FROM `arrival_record` WHERE receive_time<'2019-03-01' ; 
 DELETE FROM arrival_record WHERE id<@need_delete_max_id LIMIT 20000; 
 select ROW_COUNT();  #Return to 20000 
 
 
#After performing small batch delete, row [count(), the number of deleted rows will be returned 
#The program determines whether the returned row_count() is 0. If it is not 0, execute the following loop. If it is 0, exit the loop. The deletion is complete 
 DELETE FROM arrival_record WHERE id<@need_delete_max_id LIMIT 20000; 
 select ROW_COUNT(); 
#Program sleep 0.5s 

9, summary

When the amount of table data is too large, we should not only pay attention to the response time of accessing the table, but also pay attention to the maintenance cost of the table (for example, it takes too long to do DDL tables and delete historical data).

When ddling a large table, consider the actual situation of the table (such as whether there is a foreign key for the table to be published or not) to select an appropriate DDL change method.

delete the large data scale in small batch to reduce the pressure on the primary instance and the master-slave delay.

Original link: https://www.cnblogs.com/YangJiaXin/p/10828244.html Source network, only for learning, if there is infringement, contact delete.

I have collected high-quality technical articles and experience summary in my official account [Java circles].

In order to facilitate your study, I have compiled a set of learning materials, including Java virtual machine, spring framework, java thread, data structure, design mode, etc., which are free for students who love java! More learning and communication groups, more communication problems to make faster progress~

file

Tags: Programming MySQL socket SQL Java

Posted on Fri, 10 Apr 2020 01:52:56 -0700 by KiwiDave