Why does the number of Map buckets exceed 8 before turning into a red black tree?

Many people should be confused when they see this problem. Because most of the articles analyze how the linked list is converted into a red black tree, but they do not explain why the conversion action is only done when the length of the linked list is 8. The author's first reaction is the same, only a tentative guess is due to the trade-off between time and space.

To understand this problem, we should first understand why we need to transform. This problem is relatively simple, because the element initialization of bucket in Map is saved by linked list, and its search performance is O(n), while the tree structure can improve the search performance to O(log(n)). When the length of the linked list is very small, even if it is traversed, the speed is very fast. But when the length of the linked list keeps growing, it will definitely have a certain impact on the query performance, so it needs to be converted into a tree. As for why the threshold is 8, I think it should be the most reliable way to find the answer in the source code.

8 this threshold value is defined in HashMap, as shown below. This comment only shows that 8 is the threshold value of bin (bin is the bucket, where the elements with the same hashCode value in HashMap are saved) from the linked list to the tree, but does not explain why it is 8:

 * The bin count threshold for using a tree rather than list for a
 * bin.  Bins are converted to trees when adding an element to a
 * bin with at least this many nodes. The value must be greater
 * than 2 and should be at least 8 to mesh with assumptions in
 * tree removal about conversion back to plain bins upon shrinkage.
static final int TREEIFY_THRESHOLD = 8;

Let's go on to see that there is an Implementation notes in HashMap. The author excerpted several important descriptions. The first paragraph is as follows. It roughly means that when bin becomes large, it will be converted into bin in TreeNodes. Its structure is similar to TreeMap, that is, red black tree:

This map usually acts as a binned (bucketed) hash table, but
when bins get too large, they are transformed into bins of TreeNodes,
each structured similarly to those in java.util.TreeMap

Continue to look down, TreeNodes takes twice as much space as ordinary Nodes, so only when bin contains enough Nodes will it be converted to TreeNodes, and whether it is enough is determined by the value of treeify? Treehold. When the number of Nodes in Bin becomes less, it will be converted to normal bin. And when we check the source code, we find that when the length of the list reaches 8, it turns into a red black tree, and when the length drops to 6, it turns into a normal bin.

This explains why it is not converted to TreeNodes at the beginning, but requires a certain number of nodes to convert to TreeNodes. To put it bluntly, it is trade-off, a trade-off between space and time:

Because TreeNodes are about twice the size of regular nodes, we
use them only when bins contain enough nodes to warrant use
(see TREEIFY_THRESHOLD). And when they become too small (due to
removal or resizing) they are converted back to plain bins.  In
usages with well-distributed user hashCodes, tree bins are
rarely used.  Ideally, under random hashCodes, the frequency of
nodes in bins follows a Poisson distribution
(http://en.wikipedia.org/wiki/Poisson_distribution) with a
parameter of about 0.5 on average for the default resizing
threshold of 0.75, although with a large variance because of
resizing granularity. Ignoring variance, the expected
occurrences of list size k are (exp(-0.5)*pow(0.5, k)/factorial(k)). 
The first values are:
0:    0.60653066
1:    0.30326533
2:    0.07581633
3:    0.01263606
4:    0.00157952
5:    0.00015795
6:    0.00001316
7:    0.00000094
8:    0.00000006
more: less than 1 in ten million

When the hashCode is very discrete, the probability of using tree bin is very small, because the data is evenly distributed in each bin, and almost no bin chain length will reach the threshold. However, under the random hashCode, the discreteness may be poor. However, JDK can not prevent users from implementing this bad hash algorithm, so it may lead to uneven data distribution. However, in an ideal case, the distribution frequency of all bin nodes in the random hashCode algorithm will follow the Poisson distribution. We can see that the probability of a bin chain length reaching 8 elements is 0.00000006, which is almost impossible. Therefore, the reason why we choose 8 is not to pat the bottom, but to make a decision based on probability and statistics. It can be seen that every change and optimization of Java developed for 30 years is very rigorous and scientific.

Answers found elsewhere:

The average search length of the red black tree is log(n), if the length is 8, the average search length is log(8)=3, the average search length of the linked list is n/2, when the length is 8, the average search length is 8 / 2 = 4, which is necessary to convert into a tree; if the length of the linked list is less than or equal to 6, 6 / 2 = 3, and log(6)=2.6, although the speed is very fast, the time to convert into a tree structure and generate a tree is not It will be too short.

Published 20 original articles, won praise 17, visited 20000+
Private letter follow

Tags: less Java JDK

Posted on Sun, 15 Mar 2020 23:43:54 -0700 by sebthib55