Technology Sharing | Optimizing InnoDB's Primary Key

Author: Yves Trudeau Translator: Guan Changlong

Preface

As Percona's chief architect, one of my main responsibilities is to optimize the performance of the customer's database, which makes the work complex and interesting. In this article, I want to discuss one of the most important issues: Choosing the best InnoDB primary key.

What's special about InnoDB primary keys?

InnoDB is known as an index-structured storage engine. Primary keys use B-Tree to store data, that is, table rows. This means InnoDB must use the primary key. If the table does not have a primary key, InnoDB adds a hidden automatically incremental 6-byte counter to the table and uses the hidden counter as the primary key. There are some problems with InnoDB's hidden primary key. You should always define an explicit primary key on the table and access all InnoDB rows through the primary key value. InnoDB's secondary index is also a B-Tree. Search keywords are composed of index columns, and stored values are the primary keys of matching rows. Searching through a secondary index usually results in an implicit search of the primary key.

What is B-Tree?

A B-Tree is a data structure for optimizing operations on block devices. Block devices or disks have significant data access delays, especially mechanical hard disks. Retrieving a single byte in a random location does not take less time than retrieving larger data. This is the basic principle of B-Tree, InnoDB uses 16 KB data pages. Let's try to simplify the description of B-Tree. B-Tree is a data structure organized around this key. The key is used to search for data in B-Tree. B-Tree usually has multiple levels. Data is stored only at the bottom level, i.e. leaf nodes. Other levels of pages (nodes) contain only keys and pointers for the next level of pages. If you want to access the data of the key value, start with the top-level node-root node, compare the key it contains with the search value, and find the page to visit at the next level. Repeat this process until you reach the last level, the leaf node. In theory, each B-Tree level read requires a disk read operation. In practice, there are always memory caching nodes, which are suitable for caching because of their small number and frequent access. A Simple Three-Level B-Tree Structure

Ordered Insertion Example

Let's consider the following sysbench table:

	mysql> show create table sbtest1\G
	*************************** 1. row ***************************
	       Table: sbtest1
	Create Table: CREATE TABLE `sbtest1` (
	  `id` int(11) NOT NULL AUTO_INCREMENT,
	  `k` int(11) NOT NULL DEFAULT '0',
	  `c` char(120) NOT NULL DEFAULT '',
	  `pad` char(60) NOT NULL DEFAULT '',
	  PRIMARY KEY (`id`),
	  KEY `k_1` (`k`)
	) ENGINE=InnoDB AUTO_INCREMENT=3000001 DEFAULT CHARSET=latin1
	1 row in set (0.00 sec)
	
	mysql> show table status like 'sbtest1'\G
	*************************** 1. row ***************************
	           Name: sbtest1
	         Engine: InnoDB
	        Version: 10
	     Row_format: Dynamic
	           Rows: 2882954
	 Avg_row_length: 234
	    Data_length: 675282944
	Max_data_length: 0
	   Index_length: 47775744
	      Data_free: 3145728
	 Auto_increment: 3000001
	    Create_time: 2018-07-13 18:27:09
	    Update_time: NULL
	     Check_time: NULL
	      Collation: latin1_swedish_ci
	       Checksum: NULL
	 Create_options:
	        Comment:
	1 row in set (0.00 sec)

The Data_length value is the size of the B-Tree primary key. The secondary index of B-Tree, i.e. k_1 index, is the size of Index_length. Because the ID primary key increases, the sysbench table data is inserted sequentially. When inserted in the order of primary keys, InnoDB uses up to 15KB of data to fill space, even if innodb_fill_factor is set to 100. This results in the need to split pages after the initial insertion of data. There are also headers and footers on the page. If the page is too full to add more data, the page will be split into two. Similarly, if the fill rate of two adjacent pages is less than 50%, InnoDB will merge them. For example, this is the sysbench table inserted in ID order:

	mysql> select count(*), TABLE_NAME,INDEX_NAME, avg(NUMBER_RECORDS), avg(DATA_SIZE) from information_schema.INNODB_BUFFER_PAGE 
	    -> WHERE TABLE_NAME='`sbtest`.`sbtest1`' group by TABLE_NAME,INDEX_NAME order by count(*) desc;
	+----------+--------------------+------------+---------------------+----------------+
	| count(*) | TABLE_NAME         | INDEX_NAME | avg(NUMBER_RECORDS) | avg(DATA_SIZE) |
	+----------+--------------------+------------+---------------------+----------------+
	|    13643 | `sbtest`.`sbtest1` | PRIMARY    |             75.0709 |     15035.8929 |
	|       44 | `sbtest`.`sbtest1` | k_1        |           1150.3864 |     15182.0227 |
	+----------+--------------------+------------+---------------------+----------------+
	2 rows in set (0.09 sec)
	
	mysql> select PAGE_NUMBER,NUMBER_RECORDS,DATA_SIZE,INDEX_NAME,TABLE_NAME from information_schema.INNODB_BUFFER_PAGE 
	    -> WHERE TABLE_NAME='`sbtest`.`sbtest1`' order by PAGE_NUMBER limit 1;
	+-------------+----------------+-----------+------------+--------------------+
	| PAGE_NUMBER | NUMBER_RECORDS | DATA_SIZE | INDEX_NAME | TABLE_NAME         |
	+-------------+----------------+-----------+------------+--------------------+
	|           3 |             35 |       455 | PRIMARY    | `sbtest`.`sbtest1` |
	+-------------+----------------+-----------+------------+--------------------+
	1 row in set (0.04 sec)
	
	mysql> select PAGE_NUMBER,NUMBER_RECORDS,DATA_SIZE,INDEX_NAME,TABLE_NAME from information_schema.INNODB_BUFFER_PAGE 
	    -> WHERE TABLE_NAME='`sbtest`.`sbtest1`' order by NUMBER_RECORDS desc limit 3;
	+-------------+----------------+-----------+------------+--------------------+
	| PAGE_NUMBER | NUMBER_RECORDS | DATA_SIZE | INDEX_NAME | TABLE_NAME         |
	+-------------+----------------+-----------+------------+--------------------+
	|          39 |           1203 |     15639 | PRIMARY    | `sbtest`.`sbtest1` |
	|          61 |           1203 |     15639 | PRIMARY    | `sbtest`.`sbtest1` |
	|          37 |           1203 |     15639 | PRIMARY    | `sbtest`.`sbtest1` |
	+-------------+----------------+-----------+------------+--------------------+
	3 rows in set (0.03 sec)

This table is not suitable for buffer pools, but queries provide us with a good explanation. Pages with B-Tree primary keys have an average of 75 records and store less than 15 KB of data. sysbench inserts index k_1 in random order. sysbench creates an index after inserting rows and InnoDB uses sort files to create it. You can easily estimate the number of levels in InnoDB B-Tree. The table above requires about 40K pages (3M/75). When the primary key is a four-byte integer, each node page keeps about 1200 pointers. So there are about 35 pages above the leaf, and then at the root node on B-Tree (PAGE_NUMBER = 3), we have three levels altogether.

An example of random insertion

If you are a keen observer, you realize that inserting pages in random order with primary keys is usually discontinuous, with an average filling factor of only about 65-75%. I modified sysbench to insert and create a table in random ID order with 3M rows. The result table is much larger:

	mysql> show table status like 'sbtest1'\G
	*************************** 1. row ***************************
	           Name: sbtest1
	         Engine: InnoDB
	        Version: 10
	     Row_format: Dynamic
	           Rows: 3137367
	 Avg_row_length: 346
	    Data_length: 1088405504
	Max_data_length: 0
	   Index_length: 47775744
	      Data_free: 15728640
	 Auto_increment: NULL
	    Create_time: 2018-07-19 19:10:36
	    Update_time: 2018-07-19 19:09:01
	     Check_time: NULL
	      Collation: latin1_swedish_ci
	       Checksum: NULL
	 Create_options:
	        Comment:
	1 row in set (0.00 sec)

Although the size of inserting B-Tree primary keys in ID order is 644MB, the size of inserting B-Tree primary keys in random order is about 1GB, an increase of 60%. Obviously, our page filling factor is low:

mysql> select count(*), TABLE_NAME,INDEX_NAME, avg(NUMBER_RECORDS), avg(DATA_SIZE) from information_schema.INNODB_BUFFER_PAGE 
    -> WHERE TABLE_NAME='`sbtestrandom`.`sbtest1`'group by TABLE_NAME,INDEX_NAME order by count(*) desc;
+----------+--------------------------+------------+---------------------+----------------+
| count(*) | TABLE_NAME               | INDEX_NAME | avg(NUMBER_RECORDS) | avg(DATA_SIZE) |
+----------+--------------------------+------------+---------------------+----------------+
|     4022 | `sbtestrandom`.`sbtest1` | PRIMARY    |             66.4441 |     10901.5962 |
|     2499 | `sbtestrandom`.`sbtest1` | k_1        |           1201.5702 |     15624.4146 |
+----------+--------------------------+------------+---------------------+----------------+
2 rows in set (0.06 sec)

With random sequential insertion, the home key page is now filled with only about 10 KB of data (~66%) which is a normal and expected result. This is bad for some workload situations.

Determine the type of workload

The first step is to determine the type of workload. When you have an insert-intensive workload, it is likely that the top-level queries are inserted on some large tables, and the database will be written to disk in large quantities. If you repeat the "show process list;" on the MySQL client, you will often see these inserts. This is a typical application that records a large amount of data. There are many data collectors, and they all wait to insert data. If the waiting time is too long, some data may be lost. If you have a strict hierarchical agreement on insert time and a relaxed hierarchical agreement on read time, you obviously have an insert-oriented workload, and you should insert rows in the order of primary keys. You can also have good insertion rates on large tables, but these inserts are queued and executed in batches. No one is really waiting for these inserts to complete, and the server can easily keep up with the number of inserts. For your application, it's important that a large number of read queries go into large tables, not inserts. You have finished query tuning, and even if you have a good index, the database reads data from disk at a very high rate. When you look at the list of MySQL processes, you will see the same selection query form multiple times on the large table. The only option seems to be to add more memory to reduce disk reads, but these tables are growing rapidly and you can't add memory permanently. If you're not sure if there's a heavy insert or read workload, then you probably just don't have a heavy workload. In this case, the default is to use ordered insertion, and the best way to do this using MySQL is through automatic incremental integer primary keys. This is the default behavior of many ORM s.

Read-intensive workload

I've seen a lot of read-intensive workloads, mainly online games and social networking applications. Most importantly, some games have social networking capabilities, such as watching friends'scores during the game. Before we go any further, we need to confirm that reading is inefficient. When reading is inefficient, the top selection query form will access many different InnoDB pages, which are close to the number of rows checked. MySQL slow logs are analyzed with the pt-query-digest tool. These two quantities are exposed when the level of detail includes "InnoDB". This is an example output (I deleted some lines):

	# Query 1: 2.62 QPS, 0.00x concurrency, ID 0x019AC6AF303E539E758259537C5258A2 at byte 19976
	# This item is included in the report because it matches --limit.
	# Scores: V/M = 0.00
	# Time range: 2018-07-19T20:28:02 to 2018-07-19T20:28:23
	# Attribute    pct   total     min     max     avg     95%  stddev  median
	# ============ === ======= ======= ======= ======= ======= ======= =======
	# Count         48      55
	# Exec time     76    93ms   637us     3ms     2ms     2ms   458us     2ms
	# Lock time    100    10ms    72us   297us   182us   247us    47us   176us
	# Rows sent    100   1.34k      16      36   25.04   31.70    4.22   24.84
	# Rows examine 100   1.34k      16      36   25.04   31.70    4.22   24.84
	# Rows affecte   0       0       0       0       0       0       0       0
	# InnoDB:
	# IO r bytes     0       0       0       0       0       0       0       0
	# IO r ops       0       0       0       0       0       0       0       0
	# IO r wait      0       0       0       0       0       0       0       0
	# pages distin 100   1.36k      18      35   25.31   31.70    3.70   24.84
	# EXPLAIN /*!50100 PARTITIONS*/
	select * from friends where user_id = 1234\G

The friends table is defined as:

	CREATE TABLE `friends` (
	  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
	  `user_id` int(10) unsigned NOT NULL,
	  `friend_user_id` int(10) unsigned NOT NULL,
	  `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
	  `active` tinyint(4) NOT NULL DEFAULT '1',
	  PRIMARY KEY (`id`),
	  UNIQUE KEY `uk_user_id_friend` (`user_id`,`friend_user_id`),
	  KEY `idx_friend` (`friend_user_id`)
	) ENGINE=InnoDB AUTO_INCREMENT=144002 DEFAULT CHARSET=latin1

I built this simple example on the test server. The table is easy to fit in memory, so there is no disk to read. What matters here is the relationship between "page distin" and "Rows Examine". As you can see, the ratio is close to 1. This means InnoDB rarely accesses a row per page. For a given user_id value, the matched rows are scattered over the B-Tree primary key. We can confirm this by looking at the output of the sample query:

	mysql> select * from friends where user_id = 1234 order by id limit 10;
	+-------+---------+----------------+---------------------+--------+
	| id    | user_id | friend_user_id | created             | active |
	+-------+---------+----------------+---------------------+--------+
	|   257 |    1234 |             43 | 2018-07-19 20:14:47 |      1 |
	|  7400 |    1234 |           1503 | 2018-07-19 20:14:49 |      1 |
	| 13361 |    1234 |            814 | 2018-07-19 20:15:46 |      1 |
	| 13793 |    1234 |            668 | 2018-07-19 20:15:47 |      1 |
	| 14486 |    1234 |           1588 | 2018-07-19 20:15:47 |      1 |
	| 30752 |    1234 |           1938 | 2018-07-19 20:16:27 |      1 |
	| 31502 |    1234 |            733 | 2018-07-19 20:16:28 |      1 |
	| 32987 |    1234 |           1907 | 2018-07-19 20:16:29 |      1 |
	| 35867 |    1234 |           1068 | 2018-07-19 20:16:30 |      1 |
	| 41471 |    1234 |            751 | 2018-07-19 20:16:32 |      1 |
	+-------+---------+----------------+---------------------+--------+
	10 rows in set (0.00 sec)

Rows are usually separated by thousands of ID values. Although the rows are small, about 30 bytes, the InnoDB page does not contain more than 500 rows. With the popularity of applications and the increasing number of users, the table size is getting closer to the square of the number of users. Once the table exceeds the InnoDB buffer pool limit, MySQL begins to read from disk. Worse still, without caching, we need IOPS for each friend. If these requests are at an average rate of 300 bars per second and each user has 100 friends, MySQL needs to access up to 30,000 pages per second. Obviously, this is not in line with long-term planning.

We need to determine all the conditions for accessing the table. To do this, I used pt-query-digest and I increased the limit on the number of query forms returned. Suppose I find that:

  • 93% access to user_id
  • 5% Access friend_id
  • 2% access id

These proportions are very common. When explicit access patterns exist, we can do something. Friend-table relationships are many-to-many. With InnoDB, we should define these tables as:

	CREATE TABLE `friends` (
	  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
	  `user_id` int(10) unsigned NOT NULL,
	  `friend_user_id` int(10) unsigned NOT NULL,
	  `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
	  `active` tinyint(4) NOT NULL DEFAULT '1',
	  PRIMARY KEY (`user_id`,`friend_user_id`),
	  KEY `idx_friend` (`friend_user_id`),
	  KEY `idx_id` (`id`)
	) ENGINE=InnoDB AUTO_INCREMENT=144002 DEFAULT CHARSET=latin1

Now, rows in the B-Tree primary key are grouped by user_id sort, but inserted in random order. In other words, we slowed down the insertion speed and benefited the select statement in the table. To insert a row, InnoDB may need a disk read to get the page where the new row is located and a disk write to save it back to disk. We make tables bigger, InnoDB has fewer pages, secondary indexes bigger, because primary keys are bigger. We also added a secondary index. Now we have less data in InnoDB's buffer pool. Do we panic because there is less data in the buffer pool? No, because now when InnoDB reads pages from disk, it does not only get a matching row, but hundreds of matching rows. The number of IOPS is no longer associated with the number of friends and the rate of select statements. It is now only a factor in the rate at which select statements are passed in. The impact of not having enough memory to cache all tables is greatly reduced. As long as the storage can execute more IOPS times than the select statement rate. Using the modified table, the relevant rows of pt-query-digest output are:

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Rows examine 100   1.23k      16      34   23.72   30.19    4.19   22.53
# pages distin 100     111       2       5    2.09    1.96    0.44    1.96

With the new primary key instead of reading 30k IOPS, MySQL only needs to perform about 588 IOPS reads (~300 * 1.96). This is a much easier workload to handle. Insertions are more expensive, but if they are at a rate of 100 times per second, in the worst case it means reading IOPS 100 times and writing IOPS 100 times. This strategy works well when there is a clear access pattern. Most importantly, here are some other examples where there are usually significant access patterns:

  • Game charts (by user)
  • User preferences (by user)
  • Message applications (from or from)
  • User Object Storage (by User)
  • Like items (by item)
  • Project review (by project)

What can you do when you don't have explicit access mode? One option is to use coverage indices. Overlay indexes need to cover all required columns. The order of columns is also important, because the first must be grouped values. Another option is to use partitions to create cacheable hotspots in data sets. In this article, we see common strategies for solving read-intensive workloads. This policy does not always work - you must access data through a common pattern. But when it works, you choose the good InnoDB keys, and you become the hero of the moment!

Links to the original text: https://www.percona.com/blog/2018/07/26/tuning-innodb-primary-keys/

Tags: Database MySQL less Attribute

Posted on Wed, 04 Sep 2019 19:45:15 -0700 by runnee