Fault analysis | effectively solve the timeout problem of MySQL row lock waiting [recommended collection]

By xuty Source: original contribution *Aikesheng is produced by the open source community. The original content cannot be used without authorization. Please contact the editor for reprint and indicate the source.

1, Background

#### 20191219 10:10:10,234 | com.alibaba.druid.filter.logging.Log4jFilter.statementLogError(Log4jFilter.java:152) | ERROR |  {conn-10593, pstmt-38675} execute error. update operation_service set offlinemark = ? , resourcestatus = ?  where RowGuid = ?
com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Lock wait timeout exceeded; try restarting transaction

The above error, more or less, should have been encountered by students contacting MySQL. Professionally speaking, we call this error a lock waiting timeout. According to the type of lock, it is mainly divided into:

  • Row lock wait timeout When SQL times out due to waiting for row lock, it is row lock waiting timeout, which often occurs in multi concurrent transaction scenarios.

  • Metadata lock wait timeout When SQL times out due to waiting for metadata lock, it is called metadata lock waiting timeout, which usually occurs during DDL operation.

This article only introduces how to effectively solve row lock wait timeout, because most projects are such errors, metadata lock wait timeout is not covered.

2, Row lock waiting

Before introducing how to solve the problem of row lock waiting, this paper briefly introduces the causes of this kind of problem. Brief description of the causes: when multiple transactions operate (add, delete, and change) a row of data at the same time, MySQL will use the form of lock to prevent multiple transactions from operating a row of data at the same time, so as to avoid data inconsistency. Only the transaction allocated to the row lock has the right to operate the data row. Until the transaction ends, the row lock will be released, while other transactions not allocated to the row lock will generate row lock wait. If the waiting time exceeds the configured value (that is, the value of InnoDB ﹣ lock ﹣ wait ﹣ timeout parameter, which is configured as 5s for personal habits and 50s for MySQL by default), a row lock waiting timeout error will be thrown.

As shown in the figure above, transaction A and transaction B will Insert A piece of data with A primary key value of 1 at the same time. Because transaction A first obtains A row lock with A primary key value of 1, transaction B will wait for the row lock because it cannot be acquired. After transaction A commits, transaction B will acquire the row lock and complete the commit.

The concept of row lock is emphasized here. Although transaction B repeatedly inserts the primary key, before acquiring the row lock, the transaction is always in the state of row lock waiting. Only after acquiring the row lock, the primary key conflict error will be reported.

Of course, this problem of inter row lock conflict is relatively rare. It only appears in a large number of concurrent insert scenarios. The real common problem in projects is row lock waiting between update and delete. Here is just for example, and the principle is the same.

3, Causes

According to the problems I have been exposed to before, they can be roughly divided into the following reasons:

1. Transaction is suspended due to non database interaction in the program If the non database interaction such as interface call or file operation is embedded in the SQL transaction code, the whole transaction is likely to be suspended (interface does not wait for timeout or upload and download large attachments).

2. The transaction contains query SQL with poor performance There is a slow query in the transaction, which causes other DML S in the same transaction to fail to release the occupied row lock in time, resulting in row lock waiting.

3. A single transaction contains a large number of SQL It is usually caused by adding for loop in transaction code. Although a single SQL runs very fast, when the number of SQL is large, the transaction will be very slow.

4. Long execution time of cascaded update SQL This kind of SQL is easy to give people the illusion, for example: update A set... Where... In (select b) this kind of cascade update will not only occupy the row lock on table A, but also the row lock on table B. when the SQL executes for A long time, it is easy to cause the row lock waiting on table B.

5. Transaction pending due to disk problems In rare cases, for example, when the storage is suddenly offline, SQL execution will be stuck in the steps of the kernel calling the disk, waiting all the time, and the transaction cannot be committed. In conclusion, if the transaction has not been committed for a long time and DML operation is included in the transaction, it is possible to generate row lock waiting and error reporting.

4, Positioning difficulties

When a row lock timeout error occurs in the web log, many developers will come to me to troubleshoot the problem. Here's the difficulty of problem location!

  1. MySQL itself does not actively record the information about row lock waiting, so it is unable to effectively carry out post analysis.
  2. There are many reasons for lock contention, so it is difficult to determine which kind of problem scenario is after the event, especially when the problem cannot be reproduced after the event.
  3. After finding the Problem SQL, the development can not effectively mine the complete transaction from the code, which is also related to the company framework product project architecture. It needs to rely on the DBA to collect the complete transaction SQL afterwards for analysis.

5, Common methods

First, it introduces how individuals usually solve such problems. Here, the premise of problem solving is that the problem can be repeated. As long as it doesn't appear once, and then it doesn't appear again, the source of the problem can generally be found.

There are two scenarios for the recurrence of the problem:

  1. Manual recurrence As long as you follow a certain operation, you can repeat the error report. This scenario is simpler!

  2. Random recurrence I don't know when the error will be reported suddenly, and it can't be repeated manually. This kind of scene is difficult!

Next, write down a unified simulation scenario to reproduce the row lock timeout problem for your understanding:

--Table structure
CREATE TABLE `emp` (
`id` int(11) NOT NULL,
KEY `idx_id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

//Insert a 100w row record from 1 to 100w.

--Test process:
//Transaction 1:
start transaction;
delete from emp where id = 1;
select * from emp where id in (select id from emp);   -->Simulate slow query, which takes a long time to execute, so the transaction is not committed and the row lock is not released.
commit;

//Transaction 2:
start transaction;
delete from emp where id < 10;   --> Waiting id=1 When the row lock timeout is reached (here I have configured the timeout to be 5 s)After that, an error is reported when the row lock is over
rollback;

5.1 manual recurrence scenario

In this scenario, we usually only need to know the current innodb row lock waiting situation of MySQL through the innodb row lock waiting script. For example, we simulate the above error reporting scenario (simulating page operation) on the one hand, and use script query on the other hand (we need to query before the timeout, otherwise we will not see the error after the timeout).

/*innodb Line lock wait script*/
SELECT r.trx_mysql_thread_id waiting_thread,r.trx_query waiting_query,
concat(timestampdiff(SECOND,r.trx_wait_started,CURRENT_TIMESTAMP()),'s') AS duration,
b.trx_mysql_thread_id blocking_thread,t.processlist_command state,b.trx_query blocking_current_query,e.sql_text blocking_last_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id
JOIN performance_schema.threads t on t.processlist_id = b.trx_mysql_thread_id
JOIN performance_schema.events_statements_current e USING(thread_id)

As shown above, we can see that the thread id of transaction 2 is 76, which has been blocked by transaction 1, that is, transaction with thread id 75 for 3s, and we can see that the currently executed SQL of transaction 1 is a SELECT. It also explains that many developers often ask me why SELECT also blocks other sessions?

If this happens, it's very easy to deal with it. It's good to optimize the SELECT, but it can't be optimized. Just throw the query out of the transaction, and even don't need to dig out the whole transaction.

In fact, the simulation of the above problem corresponds to the second point in the third section (query SQL with poor performance is included in the transaction). Next, we simulate the first point (transaction suspension caused by non database interaction in the program) and compare the following phenomena.

We just need to change the process of transaction 1 to the following.

Transaction 1:
start transaction;
delete from emp where id = 1;
select * from emp where id in (select id from emp);
Wait for 60s (don't do anything) - > the analog interface call times out, the transaction is rammed, and then the commit is executed.
commit;

Using script again, we can see that the phenomenon is different. The difference is that the blocking transaction is in sleep state, that is, the transaction is not running SQL at present. From the perspective of DBA, the eight achievements of this kind of phenomenon can be concluded that the code embeds other interactive operations in the transaction to cause the transaction suspension (in addition, it may also be the transaction deadlock caused by network problems), because the program is not like human beings, it will not be lazy, it will not appear the transaction execution to half, take a rest and commit again.

If it is a problem of this kind of phenomenon, because the essence is not the transaction suspension caused by slow SQL, we must find the corresponding point in the code to see what interaction is stuck.

Here, we need to develop to check the code to find the source, but the only available information is the last SQL executed by the transaction, which is the last column in the figure above. From my previous experience (in the vast majority), it is difficult to find the specific location of the code from this SQL alone, especially when this SQL is a very common one SQL, more difficult!

In the face of this situation, it is necessary for DBA s to find out all the SQL executed by this transaction, and then let developers check the code, which should be much less difficult. Here, you need to use the general log of MySQL, which is used to record all the running SQL in MySQL.

--See general_log Open or not, and file name
mysql> show variables like '%general_log%';
+------------------+--------------------------------------+
| Variable_name    | Value                                |
+------------------+--------------------------------------+
| general_log      | OFF                                  |
| general_log_file | /data/mysql_data/192-168-188-155.log |
+------------------+--------------------------------------+

--Temporarily on general_log
mysql> set global general_log = 1;
Query OK, 0 rows affected (0.00 sec)

--Temporarily closed general_log
mysql> set global general_log = 0;
Query OK, 0 rows affected (0.00 sec)

After opening the general log, wait for the thread ID in the query result of the script through the innodb row lock during manual replication, and go to the general log to find the corresponding transaction analysis, as follows:

According to the thread ID, it is easy to find the transaction operation corresponding to the time point from the general? Log (in the actual scenario, it may need to be filtered through the pipeline command). As shown in the figure above, all the SQL of transaction 1 and transaction 2 can be found, and then the corresponding location can be found in the code through these SQL. For example, for the transaction with thread ID 111 in the figure above, after executing select * from emp where id in (select id from emp), the reason is that the SQL is either Queries are slow, or the code is performing other interactions.

PS: general log will record all SQL, so it has a great impact on MySQL performance and is prone to skyrocketing, so it is only temporarily turned on during troubleshooting. Please turn it off in time after troubleshooting!

5.2 random recurrence scenarios

Compared with the manual recurrence scenario, because of its randomness, this scenario can't simulate error reporting and query the specific blocking situation through scripts at the same time. Therefore, other ways are needed to monitor the blocking situation of MySQL. I usually run the monitoring script (InnoDB ﹣ lock ﹣ monitor. SH) in the background on Linux to record MySQL blocking. The script is as follows:

#!/bin/bash

#Account, password and monitoring log
user="root"
password="Gepoint"
logfile="/root/innodb_lock_monitor.log"

while true
do
num=`mysql -u${user} -p${password} -e "select count(*) from information_schema.innodb_lock_waits" |grep -v count`
if [[ $num -gt 0 ]];then
date  >> /root/innodb_lock_monitor.log
mysql -u${user} -p${password} -e  "SELECT r.trx_mysql_thread_id waiting_thread,r.trx_query waiting_query, \
concat(timestampdiff(SECOND,r.trx_wait_started,CURRENT_TIMESTAMP()),'s') AS duration,\
b.trx_mysql_thread_id blocking_thread,t.processlist_command state,b.trx_query blocking_query,e.sql_text \
FROM information_schema.innodb_lock_waits w \
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id \
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id \
JOIN performance_schema.threads t on t.processlist_id = b.trx_mysql_thread_id \
JOIN performance_schema.events_statements_current e USING(thread_id) \G " >> ${logfile}
fi
sleep 5
done

View again

--Use nohup Command to run the monitoring script in the background
[root@192-168-188-155 ~]# nohup sh innodb_lock_monitor.sh  &
[2] 31464
nohup: ignoring input and appending output to 'nohup.out'

--See nohup.out Whether there is an error
[root@192-168-188-155 ~]# tail -f nohup.out
mysql: [Warning] Using a password on the command line interface can be insecure.
mysql: [Warning] Using a password on the command line interface can be insecure.
mysql: [Warning] Using a password on the command line interface can be insecure.

--Regularly check whether the monitoring log has output(If there is no output, this log will not be generated!)
[root@192-168-188-155 ~]# tail -f innodb_lock_monitor.log
Wed Feb  5 11:30:11 CST 2020
*************************** 1\. row ***************************
waiting_thread: 112
waiting_query: delete from emp where id < 10
duration: 3s
blocking_thread: 111
state: Sleep
blocking_query: NULL
sql_text: select * from emp where id in (select id from emp)

When the monitoring log has output blocking information, the subsequent solution is consistent with the previous manual replication scenario.

  • If the transaction is stuck in slow SQL, you need to optimize SQL.

  • If the transaction is suspended, analyze the transaction through the general log, and then find the specific code location.

PS: after troubleshooting, please close the background monitoring process in time, and directly close it by kill+pid!

6, Performance? Schema

The previous method is not very convenient, because the general log needs to access the server, and filtering analysis is also difficult. It needs a certain MySQL foundation and Linux foundation to apply, so we want to find a more simple method.

6.1 method introduction

My idea is to use the performance ﹣ schema performance engine provided by MySQL 5.5 to analyze. Performance ﹣ schema is a feature provided by Mysql to monitor the performance of MySQL server at the bottom of the system. It provides a large number of monitoring items, including lock, IO, transaction, memory usage, etc.

Introduce the following main principles:

  1. The main tables used are 2 events, transactions, history, and events, statements, history, long.
  2. Transactions > History > long records historical transaction information, and events > statements > History > long records historical SQL.
  3. Get the thread ID of the rollback transaction from transactions [history] long, then filter out the suspicious transaction according to the time range, and finally get the SQL corresponding to the transaction from events [statements] history [long, and find out which is the source.

advantage:

  1. You do not need to get the transaction SQL through the general? Log.
  2. There is no need to monitor the script to get the row lock wait.
  3. It can be implemented only by accessing MySQL, without accessing the server.
  4. The performance overhead is small and will not skyrocket, because it is written by loop coverage.
  5. You can know the run time of each SQL.

Disadvantages:

  1. By default, there are only a few records in the history ﹣ long related tables, which may brush off useful data, especially in systems where SQL runs more.
  2. If you want to increase the maximum number of reserved rows in the history_long related table, you need to restart MySQL and cannot modify the parameters online.
  3. The time in the history ﹣ long related table records is relative time, that is, the time to start MySQL. It doesn't seem very convenient.
  4. The history long related table will not actively record the information of row lock waiting, so it can only select suspicious transactions according to the time range first, and then analyze them further, which is not as accurate as script monitoring and positioning.
/*To enable performance ﹣ schema related monitoring items, you need to enable performance ﹣ schema in advance*/
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' where name = 'transaction';
UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' where name like '%events_transactions%';
UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' where name like '%events_statements%';

/*Check the rollback transaction SQL to confirm whether it is the transaction with error in the log*/
SELECT a.THREAD_ID
,b.EVENT_ID
,a.EVENT_NAME
,CONCAT (b.TIMER_WAIT / 1000000000000,'s') AS trx_druation
,CONCAT (a.TIMER_WAIT / 1000000000000,'s') sql_druation
,a.SQL_TEXT,b.STATE,a.MESSAGE_TEXT
FROM performance_schema.events_statements_history_long a
JOIN performance_schema.events_transactions_history_long b ON a.THREAD_ID = b.THREAD_ID
AND (a.NESTING_EVENT_ID = b.EVENT_ID OR a.EVENT_ID = b.NESTING_EVENT_ID)
WHERE b.autocommit = 'NO' AND a.SQL_TEXT IS NOT NULL AND b.STATE = 'ROLLED BACK'

/*View the transaction SQL of suspicious transactions over 5s in this time period. Here, the default InnoDB? Lock? Wait? Timeout is 5s*/
SELECT a.THREAD_ID
,b.EVENT_ID
,a.EVENT_NAME
,CONCAT (b.TIMER_WAIT / 1000000000000,'s') AS trx_druation
,CONCAT (a.TIMER_WAIT / 1000000000000,'s') sql_druation
,a.SQL_TEXT,b.STATE,a.MESSAGE_TEXT,a.ROWS_AFFECTED,a.ROWS_EXAMINED,a.ROWS_SENT
FROM performance_schema.events_statements_history_long a
JOIN performance_schema.events_transactions_history_long b ON a.THREAD_ID = b.THREAD_ID
AND (a.NESTING_EVENT_ID = b.EVENT_ID OR a.EVENT_ID = b.NESTING_EVENT_ID)
WHERE b.autocommit = 'NO' AND SQL_TEXT IS NOT NULL AND b.STATE = 'COMMITTED'
AND b.TIMER_WAIT / 1000000000000  > 5
AND b.TIMER_START < (SELECT TIMER_START FROM performance_schema.events_transactions_history_long
WHERE THREAD_ID = 70402  /*Thread ID in the above SQL query results*/
AND EVENT_ID = 518)      /*Event ID in the above SQL query results*/
AND b.TIMER_END > ( SELECT TIMER_END FROM performance_schema.events_transactions_history_long
WHERE THREAD_ID = 70402  /*Thread ID in the above SQL query results*/
AND EVENT_ID = 518)     /*Event ID in the above SQL query results*/
ORDER BY a.THREAD_ID

6.2 test simulation

If this method is used, there is no need to split the dynamic or random recurrence. The operation methods are the same. How to operate it is simulated as follows:

  1. First, use the above method to open performance ﹣ schema related monitoring items, which will take effect directly without restarting MySQL.
  2. Then repeat the problem. Here it's better to manually repeat (because if you don't check it in time, the monitoring data may be brushed off). If you can't, you can only wait for random repeat.
  3. After the problem is repeated, query whether there is a rollback transaction through the above script (that is, the transaction rolled back due to the timeout of row lock).

  1. Then, according to the thread ID and event ID of the rollback transaction, it is brought into the last script to view the suspicious transaction for analysis.

Because this is a test environment simulation, the results are very clear. There may be many actual output results on the project. It is necessary to analyze whether there is a problem in the transaction one by one!

7, Summary

After the actual test, it is found that there are many restrictions to check the row lock waiting timeout by performance ﹣ schema, and the final analysis is also a technical work, which is not as simple as initially thought, which is a bit contrary. Three points are the most difficult to deal with when troubleshooting problems through performance ﹣ schema: 1. How to convert relative time into absolute time? There is no good way to solve this problem. 2. The downlink lock waiting information will not be recorded actively, so it can only be further analyzed after swiping the time node. 3. The problem of records being flushed is that it is a memory table, which is easy to overflow due to large settings, and easy to be flushed out due to small settings.

Tags: Database SQL MySQL less

Posted on Tue, 21 Apr 2020 02:02:36 -0700 by oyse