ThreadDump analysis note analysis stack

Recently, I was tracking and analyzing some system problems. I went through previous notes and books, and suddenly found what I had written before ThreadDump analysis notes (1) interpretation stack
Ha ha ha, OK, make up two by the way. Thread stack is a kind of data that we often use for troubleshooting, which has a high value. But the thread stack is printed by Pilate. The last time I talked about the basic concept, I'll talk about how to look at it today.

##0x01 what can I do with thread stack

The thread stack mainly reflects what the current system thread is doing. The stack can be analyzed from several perspectives

Local stack information (that is, detailed stack information of each thread)

It records the hierarchical relationship of each thread call, which function is being executed by the current thread, what is the current thread status, which locks the current thread holds, and which locks it is waiting for.

Stack Statistics (one for each thread stack)

It can be seen whether there are many lock competitions, deadlocks, what state most threads are in and the total number of threads.

Multiple stack comparison information

The advantage of multiple stacks is that they can form a comparison. Only through comparison statistics can you find the problem.

For example, every time the thread of printing stack is always in the same context (such as executing a certain piece of code), it is necessary to consider whether it is reasonable or has a bottleneck.

For example, if a thread is waiting for a lock in multiple stacks, it is necessary to see why the lock is not released.

The principle of thread stack is similar in problem analysis. We do some common problems, which basically cover the frequent phenomena in production environment

##0x02 loop analysis (multiple stack comparison information)

The BUG in the code will cause a dead cycle, which will significantly increase the CPU. At this time, you can search for suspicious threads through the method of multiple comparisons mentioned above. The specific steps are as follows:

  • Using jstack PID > pid.dump to export the first thread stack file
  • Wait for tens of seconds to reread and export the second dump file
  • Compare the two files and exclude the data in WAITING and SLEEPING states, because these two states do not consume CPU
  • Find out the threads that have been active in this period of time. If the same thread is in the same context, this should be the focus of troubleshooting. Combine the code to see if the context that the thread executes belongs to the code that should run for a long time.

If no threads can be found after the comparison, it is necessary to consider whether the CPU is too high due to unreasonable memory settings, such as whether frequent full gc is caused by not setting the size of meta space, and whether frequent GC is caused by the promotion of a large number of objects due to too small new generation, thus causing the CPU to be too high.

0x03 CPU consumption analysis

The dead cycle will cause the CPU to be too high, which is a BUG. But in most cases, there are many CPU intensive logical calculations that will also pull up the CPU. These problems may be a little inconvenient to analyze, because it is not enough to simply pass through the thread stack. We need to use some small commands of linux.

We all know that our application runs in linux as a process, and each process contains many threads. What we use in Java is. In linux, we can use the top command to check the cpu percentage. What we need to know is that the local thread (LWPID) in linux corresponds to the java thread one by one. The top shows the thread ID in decimal system, while the nid in the online program stack is represented by 16 forbidden. (the knowledge point kernel thread KLT, lightweight process LWP and user thread UT involved can be seen.)

The specific analysis steps are as follows:

  • Top HP PID is used to obtain the statistics of all threads in the current process. PID is the thread ID (LWPID). This ID corresponds to the nid (native thread id) in the java thread dump. nid is hexadecimal. As shown in the figure, PID=221 is equal to nid=0xdd. You can use printf "% x\n" 221 to convert the ID

  • Search the java thread stack for nid=0xdd

If the corresponding thread PID found is executing java code, it means that the Java code causes the cpu to be too high, such as:

    "Thread-444" prio=1 tid=0xa4853568 nid=0xdd runnable [0xafcf7000..0xafcf8680] 
        //The code currently executing is pure Java code
         at org.apache.commons.collections.ReferenceMap.getEntry(Unknown Source) 
        at org.apache.commons.collections.ReferenceMap.get(Unknown Source)

If the PID found is executing the line Native code, it means that the code causing CPU over high is in JNI call, such as:

    "Thread-609" prio=5 tid=0x01583d88 nid=0xdd runnable [7a680000..7a6819c0] 
        //CheckLicense is the Native method, indicating that the problem code that causes the CPU to be too high is in the local code.
         at meetingmgr.conferencemgr.Operation.CheckLicense(Native method) 
        at meetingmgr.MeetingAdapter.prolongMeeting(MeetingAdapter.java:171)    

This positioning method can basically hit the problem in one time. No matter what causes the CPU to be too high, it can be found out. This method consumes little and the production environment is very suitable for use.

##0x04 resource shortage analysis

The lack of system resources is like a piece of highway becoming a catwalk. Even if other places are wide, it is useless. This is the barrel effect. For example, if the database connection of a service provider in dubbo link is not enough, resource competition will occur. The threads requesting the resource will be blocked or suspended. If the system design is poor, it may cause system avalanche. So the shortage of system resources and system bottleneck are the same kind of problems.

If the system resource is insufficient, there will be a large number of threads waiting for the resource, and the print thread stack will find this feature, which means that the resource is the bottleneck of the system. At this time, the print thread stack will find a large number of threads stopping on the same method call.

For example: a large number of http-1-Processor threads on the stack are all stopped on org.apache.commons.pool.impl. Genericobjectpool.borowobject, indicating that a large number of threads are waiting for the resource. This shows that the system resource is the bottleneck.

 http-8082-Processor84" daemon prio=10 tid=0x0887c000 nid=0x5663 in Object.wait() 
       java.lang.Thread.State: WAITING (on object monitor)
       at java.lang.Object.wait(Object.java:485) 
       at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(Unknown Source)
       - locked <0x75132118> (a org.apache.commons.dbcp.AbandonedObjectPool)
       at org.apache.commons.dbcp.AbandonedObjectPool.borrowObject()
       - locked <0x75132118> (a org.apache.commons.dbcp.AbandonedObjectPool) 
       at org.apache.commons.dbcp.PoolingDataSource.getConnection()
       at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:312)  
      at dbAccess.FailSafeConnectionPool.getConnection(FailSafeConnectionPool.java:162) 
      at servlets.ControllerServlet.doGet(ObisControllerServlet.java:93)

The reasons for insufficient resources combined with stack information may be as follows:

  • Caused by insufficient connection pool configuration

  • The current concurrency of the system suddenly increases, leading to resource exhaustion

  • If the resource is not released for a long time, it will be exhausted. If the SQL statement is inappropriate, it will not hit the index, or the database access will be too slow if there is no index

  • After the resources are used up, under some abnormal circumstances, they are not closed or returned to the pool, resulting in the leakage or reduction of available resources, which leads to resource competition

In terms of system performance, insufficient resources are performance problems, which tend to slow down the system and lead to abnormal response.

0x05 chain analysis

First of all, two or more threads of deadlock do not consume CPU. Only when the loop is dead and the code in the loop is CPU intensive, can it cause 100% CPU utilization. IO operations such as socket or database do not consume CPU much.

The chain is not a deadlock. Sometimes the printed stack, many threads are waiting for different locks. Some lock competition may be caused by another lock object competition. At this time, we need to find out the root cause. As for the following stack information, there are more than 40 threads waiting for lock 0xbef17078 and more than 10 threads waiting for 0xbc7b4110.

"Thread-1021" prio=5 tid=0x0164eac0 nid=0x41e waiting for monitor entry[...] 
at meetingmgr.timer.OnMeetingExec.monitorExOverNotify(OnMeetingExec.java:262)
- waiting to lock <0xbef17078> (a [B) //Wait for lock 0xbef17078  
at meetingmgr.timer.OnMeetingExec.execute(OnMeetingExec.java:189)  
at util.threadpool.RunnableWrapper.run(RunnableWrapper.java:131)  
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(...)  
at java.lang.Thread.run(Thread.java:534)  

"Thread-196" prio=5 tid=0x01054830 nid=0xe1 waiting for monitor entry[...]  
at meetingmgr.conferencemgr.Operation.prolongResource(Operation.java:474)  
- waiting to lock <0xbc7b4110> (a [B) //Wait for lock 0xbc7b4110  
at meetingmgr.MeetingAdapter.prolongMeeting(MeetingAdapter.java:171)  
at meetingmgr.FacadeForCallBean.applyProlongMeeting(FacadeFroCallBean.java:190) 
at meetingmgr.timer.OnMeetingExec.monitorExOverNotify(OnMeetingExec.java:278) 
- locked <0xbef17078> (a [B) //Possession lock 0xbef17078  
at meetingmgr.timer.OnMeetingExec.execute(OnMeetingExec.java:189) 
at util.threadpool.RunnableWrapper.run(RunnableWrapper.java:131)  
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(...)  
at java.lang.Thread.run(Thread.java:534)  


"Thread-609" prio=5 tid=0x01583d88 nid=0x280 runnable [7a680000..7a6819c0] 
at java.net.SocketInputStream.socketRead0(Native method)  
at oracle.jdbc.ttc7.Oall7.recieve(Oall7.java:369) 
at net.sf.hiberante.impl.QueryImpl.list(QueryImpl.java:39) | 
at meetingmgr.conferencemgr.Operation.prolongResource(Operation.java:481) 
- locked <0xbc7b4110> (a [B) //Possession lock 0xbc7b4110  
at meetingmgr.MeetingAdapter.prolongMeeting(MeetingAdapter.java:171) 
at meetingmgr.timer.OnMeetingExec.execute(OnMeetingExec.java:189) 
at util.threadpool.RunnableWrapper.run(RunnableWrapper.java:131) 
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(...)
at java.lang.Thread.run(Thread.java:534)
  1. See that there are more than 40 threads waiting for the lock 0xbef17078. First find the thread that has occupied the lock, "Thread-196"

  2. See that "Thread-196" has occupied the lock 0xbef17078, but it is waiting for the lock < 0xbc7b4110 >, then it is necessary to find out the thread occupying the lock, namely "Thread-609"

  3. Then the thread that owns the lock < 0xbc7b4110 > is the root of the problem. The next step is to find out why the thread has occupied the lock for a long time. The possible reason is that the thread holding the lock is executing code with low performance, which causes the lock to take too long.

0x06 what can't be parsed by thread stack

The problem of thread stack positioning can only locate the problems that leave traces on the current thread, such as deadlock, hang and other system performance problems caused by locks. But there is nothing we can do about leaving no trace. For example:

  • Thread leakage, why is it not properly recycled
  • The BUG caused by thread concurrency has no trace in the thread stack, so the thread stack can not provide any help.
  • The problem of database locking table is that a transaction is not committed / rolled back, but the information can not be displayed in the stack, so stack analysis is not helpful.

41 original articles published, 46 praised, 170000 visitors+
Private letter follow

Tags: Java Apache Linux Database

Posted on Tue, 17 Mar 2020 00:17:49 -0700 by snaack