When Tomcat meets Netty

Story background

Di ~ Di ~ Di ~, production accident, memory leak!

Yesterday afternoon, I suddenly received the message of operation and maintenance that the memory in the production environment of a system in the branch leaked, so I helped to check it.

Troubleshooting process

The first step is to log

The exception log given by the branch is about this (in view of the company's regulation that screenshots are forbidden to be taken and any information is forbidden to be transmitted outside, here is a similar error report found on my Internet):

LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.

Recent access records: 

#1: 
  io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:273)
  io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
  io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
  io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
	io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	java.lang.Thread.run(Thread.java:748)

At this point, it's amazing that ByteBuf is not released, resulting in a memory leak.

Step 2: look at the memory index

Now that we know the memory leak, let the O & m see the memory usage, especially the out of heap memory usage (because Netty is used). According to the O & M feedback, the in heap memory usage is normal, and the out of heap memory remains high.

OK, so far, it can be clearly asserted that there is an out of heap memory leak.

At this time, it is divided into two steps: one is to replace gateway with zuul piezometric observation, and the other is to troubleshoot memory leakage.

The third step is to code

Let the person in charge of this branch project show me the code. I'll open it. It's silly. It's a simple Spring Cloud Gateway project, which also contains two classes. One is used by AuthFilter for permission verification, and the other is used by XssFilter for anti attack.

Spring Cloud Gateway uses Netty, zuul 1.x uses Tomcat. This article comes from tongge reading source code of gongzong.

Step 4: initial suspicion

Quickly scan the code of each class, and see the code related to ByteBuf in XssFilter. However, there is no obvious information that ByteBuf has not released. Very simply, first mask this class to see if there is any memory leak.

However, how to detect whether there is a memory leak? We can't delete this category. Let's run on production.

Step 5: parameter and monitoring transformation

In fact, it's very simple. After reading the source code of Netty, it should be clear that by default, Netty uses pooled direct memory to implement ByteBuf, which is PooledDirectByteBuf. Therefore, in order to debug, first of all, turn off the pooling function.

Direct memory, that is, out of heap memory.

Why turn off pooling?

Because pooling is a kind of cache for memory, it allocates 16M memory at a time and does not release it immediately. It is inconvenient to observe after opening pooling, unless it is debugged slowly.

So, how to turn off pooling?

In Netty, all bytebufs are created through a bytebufalocator. In the interface bytebufalocator, there is a default allocator. Find the default allocator, and then find the place where it was created. You can see the relevant code.

public interface ByteBufAllocator {

    ByteBufAllocator DEFAULT = ByteBufUtil.DEFAULT_ALLOCATOR;
}
public final class ByteBufUtil {

    static final ByteBufAllocator DEFAULT_ALLOCATOR;

    static {
        // This article comes from the source code of tongge
        String allocType = SystemPropertyUtil.get(
                "io.netty.allocator.type", PlatformDependent.isAndroid() ? "unpooled" : "pooled");
        allocType = allocType.toLowerCase(Locale.US).trim();

        ByteBufAllocator alloc;
        if ("unpooled".equals(allocType)) {
            alloc = UnpooledByteBufAllocator.DEFAULT;
            logger.debug("-Dio.netty.allocator.type: {}", allocType);
        } else if ("pooled".equals(allocType)) {
            alloc = PooledByteBufAllocator.DEFAULT;
            logger.debug("-Dio.netty.allocator.type: {}", allocType);
        } else {
            alloc = PooledByteBufAllocator.DEFAULT;
            logger.debug("-Dio.netty.allocator.type: pooled (unknown: {})", allocType);
        }

        DEFAULT_ALLOCATOR = alloc;
    }

}

As you can see, it is controlled by io.netty.allocator.type.

OK, add this parameter to the JVM startup parameter and assign it unpooled.

After the pooling function is turned off, it is necessary to be able to observe whether the memory leaks in real time. What should we do?

In fact, this is also very simple. The platform dependent class of Netty will count all direct memory usage.

Recently, I have been studying the source code of Netty, so I have a clear understanding of various details of Netty. This article comes from reading the source code by tongge gongzong. Recently, I am still preparing it. After finishing it, I will start the creation of Netty column.

So, we just need to write a timer and print out the statistics regularly. Here, I will directly give the code:

@Component
public class Metrics {

    @PostConstruct
    public void init() {
        ScheduledExecutorService scheduledExecutorService = Executors.newSingleThreadScheduledExecutor();
        scheduledExecutorService.scheduleAtFixedRate(()->{
            System.out.println("used direct memory: " + PlatformDependent.usedDirectMemory());
        }, 1, 1, TimeUnit.SECONDS);
    }
}

Just drop it in the same level or lower level directory as the startup class.

So far, pooling and monitoring have been completed. Here is the debugging.

Step 6: preliminary commissioning

Run the startup class directly and observe the log.

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0

At the beginning, the direct memory was normal, always 0.

Send a random request, report 404, and observe whether the direct memory has not changed, or 0, indicating that it is not enough to simulate a random request, which is directly blocked by spring, but not Netty.

Step 7: modify the configuration

If any request fails, it can only simulate normal request forwarding. I quickly start a spring boot project, define a request in it, modify the configuration of gateway, and let it forward to the past:

spring:
  cloud:
    gateway:
      routes:
      - id: test
        uri: http://localhost:8899/test
        predicates:
        - Path=/test

Step 8, debug again

After modifying the configuration, start two projects at the same time, one gateway and one springboot, request sending, and observe the usage of direct memory:

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 1031
used direct memory: 1031
used direct memory: 1031

Sure enough, the memory is not released.

Step 9: delete XssFilter

In order to verify the previously suspected XssFilter, delete it, start the project again, send the request, and observe the use of direct memory.

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 1031
used direct memory: 1031
used direct memory: 1031

The problem remains, and it's the same size as the previous leak.

This is the case. Netty depends on guess to determine the size of each memory allocation. The initial value of this guess is 1024.

@Override
public ByteBuf allocate(ByteBufAllocator alloc) {
    return alloc.ioBuffer(guess());
}

Didn't you expect that Netty has such a lovely side? Cough, it's off topic, pull it back!

Then, there is a 7B that stores the line feed and carriage return characters. The 7B will not be released, and the total is 1031.

private static final byte[] ZERO_CRLF_CRLF = { '0', CR, LF, CR, LF };
// 2B
private static final ByteBuf CRLF_BUF = unreleasableBuffer(directBuffer(2).writeByte(CR).writeByte(LF));
// 5B
private static final ByteBuf ZERO_CRLF_CRLF_BUF = unreleasableBuffer(directBuffer(ZERO_CRLF_CRLF.length)
        .writeBytes(ZERO_CRLF_CRLF));

Well, it's interesting. Since it's not the problem of XssFilter, is it the problem of AuthFilter?

Step 10, kill AuthFilter

Do as you say, kill the AuthFiler, restart the project, send the request, and observe the direct memory:

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 1031
used direct memory: 1031
used direct memory: 1031

The problem is still there, or the familiar memory size.

At this time, my thinking is not smooth, the following is the way of deviation.

Step 11, thinking

After removing XssFilter and AuthFilter, there is only one startup class left. Of course, there is a new monitoring class.

Is there a problem with Spring Cloud Gateway itself? Eh, I seem to have found a new continent. If I find a problem with Spring Cloud Gateway, I can boast about it later.

Since the memory allocation is not released, we will find the location of the memory allocation and make a breakpoint.

Through the previous analysis, we already know that the memory allocator used is unpooledbytebufalocator, so we should make a breakpoint in its newDirectBuffer() method, because we have a direct memory leak here.

Step 12, step by step debugging

According to the idea of step 11, make a breakpoint in the newDirectBuffer() method of unpooledbytebufalocator, debug step by step, and finally come to this method:

// io.netty.handler.codec.ByteToMessageDecoder.channelRead
@Override
public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
    if (msg instanceof ByteBuf) {
        CodecOutputList out = CodecOutputList.newInstance();
        try {
            first = cumulation == null;
            // 1. The returned msg itself is a ByteBuf
            cumulation = cumulator.cumulate(ctx.alloc(),
                    first ? Unpooled.EMPTY_BUFFER : cumulation, (ByteBuf) msg);
            // 2. Decoding, this article comes from tongge reading source code of gongzong
            callDecode(ctx, cumulation, out);
        } catch (DecoderException e) {
            throw e;
        } catch (Exception e) {
            throw new DecoderException(e);
        } finally {
            if (cumulation != null && !cumulation.isReadable()) {
                numReads = 0;
                // 3. Free memory
                cumulation.release();
                cumulation = null;
            } else if (++ numReads >= discardAfterReads) {
                // We did enough reads already try to discard some bytes so we not risk to see a OOME.
                // See https://github.com/netty/netty/issues/4275
                numReads = 0;
                discardSomeReadBytes();
            }

            int size = out.size();
            firedChannelRead |= out.insertSinceRecycled();
            // 4. Read the remaining values in out
            fireChannelRead(ctx, out, size);
            // 5. Recycle out
            out.recycle();
        }
    } else {
        ctx.fireChannelRead(msg);
    }
}

It took several hours in the middle, especially when the channel pipeline jumped by accidentally and had to go over again. It really can only come one step at a time.

This method is mainly used to convert ByteBuf into message. Message is a message, which can be understood as a simple Java object. The main logic is indicated in the above code.

As you can see, there is a calculation. Release();, which is the place to release memory. However, it is not released. Before calling this line of code, the reference count of MSG (= calculation) is 4, and after release is 2, so there are also counts, which cannot be recycled.

After the following steps 4 and 5, the out is recycled, and the msg is still not recycled. The problem must be this one.

I've been struggling here for a long time, including the code in decode. I've read the code many times. The object in the unreleased msg after content conversion is DefaultHttpContent, which represents the body of the Http request, but here is the body of the returned value of the Http request.

This is also a puzzle for me. I tried. It seems that the body of Http request didn't come up to this logic, and then repeatedly looked for the body of Http request. After a long time, there was no progress.

By 9:00 p.m., there were no people in the office, and the lights were off (during the epidemic, only a few people can go to each department every day), so I packed them and went home.

Step 13 take a taxi home

When I was in the car, I kept thinking about this problem. When I recalled the whole process, would it be my wrong direction?

Spring Cloud Gateway has been out for a long time. I didn't hear about the memory leak. At this time, I began to doubt myself.

No, I have to write a project by myself when I go home. I'll run it with Spring Cloud Gateway.

Step 14: write a project using Spring Cloud Gateway

When I got home, I quickly turned on my computer and started to write a project using Spring Cloud Gateway and a project using spring boot. I opened the monitoring, removed the pooling function, started the project, sent the request, and observed the direct memory.

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0

Nani, Asiba, it's clear that it's not a Spring Cloud Gateway problem. What's the problem?

It must be the wrong posture, but there is nothing else in the company's project. I have deleted all the classes, only the startup class.

Oh no, pom file.

Open the springboard machine, log in to the company computer, check pom.xml, and find that there are some references of spring boot or spring cloud itself.

Well, no, there is a common package, which is written by the division itself. Click it and there are three jar packages in it. One of them is very striking, tomcat!!!!

Oh, my second Olympics. At this time, I really want to scold my mother. It's nothing~~

In fact, when I delete AuthFilter, I should think about the problem of pom. At that time, I was only concerned about the problem of possible bug s in YY Spring Cloud Gateway, so I went in at first.

We know that Spring Cloud Gateway uses Netty as the server to receive requests and then forward them to the downstream system. What happens when tomcat is referenced here? It's really interesting.

Step 15, get rid of tomcat

In the pom file, exclude the tomcat jar package, restart the project, send the request, and observe the direct memory:

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0

Oh, no problem. It's tomcat.

So, how does tomcat make trouble? tomcat can also respond to the request, and the request can be forwarded and returned to the client. What's more, Netty is actually used internally to respond to the request, which is a bit magical.

Step 16: discover the new world

In order to verify this problem, we'd better exit the springboard machine, go back to my own computer, add tomcat in the pom, start the project, eh, it can really afford to play well~~

Is tomcat and Netty listening on the same port at the same time, and both are up?

Observe the project startup log:

Connected to the target VM, address: '127.0.0.1:52162', transport: 'socket'

  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::        (v2.2.6.RELEASE)

2020-05-19 08:50:04.448  INFO 7896 --- [           main] com.alan.test.Application                : No active profile set, falling back to default profiles: default
2020-05-19 08:50:04.829  INFO 7896 --- [           main] o.s.cloud.context.scope.GenericScope     : BeanFactory id=082e67ca-d4c7-3a8c-b051-e806722fd225
2020-05-19 08:50:04.998  INFO 7896 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat initialized with port(s): 8080 (http)
2020-05-19 08:50:05.006  INFO 7896 --- [           main] o.apache.catalina.core.StandardService   : Starting service [Tomcat]
2020-05-19 08:50:05.006  INFO 7896 --- [           main] org.apache.catalina.core.StandardEngine  : Starting Servlet engine: [Apache Tomcat/9.0.33]
2020-05-19 08:50:05.859  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [After]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Before]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Between]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Cookie]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Header]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Host]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Method]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Path]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Query]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [ReadBodyPredicateFactory]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [RemoteAddr]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : This article comes from the source code of tongge
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [CloudFoundryRouteService]
2020-05-19 08:50:06.731  INFO 7896 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
2020-05-19 08:50:07.304  INFO 7896 --- [           main] com.alan.test.Application                : Started Application in 4.271 seconds (JVM running for 5.0)

It is found that only tomcat is started, so how does it transfer the request to Netty for processing?

Step 17, Tomcat - > netty

Students who have studied NIO should know that NIO divides SocketChannel into two types: ServerSocketChannel and SocketChannel. ServerSocketChannel is created when the service is started to monitor the arrival of the client connection, while SocketChannel represents the connection between the client and the server.

Students who have seen NIO source code also know that SocketChannel is created through ServerSocketChannel.

Students who have seen the source code of Netty also know that according to different protocols, Netty divides these channels into NioXxxChannel, EpollXxxChannel, etc., and channels for each protocol are also divided into NioServerSocketChannel, NioSocketChannel, etc.

On the Windows platform, NioXxxChannel is used by default. From the above, NioSocketChannel should be created through NioServerSocketChannel. If Netty is used normally, it is true.

The following figure shows the thread stack when NioSocketChannel is created when Netty is used normally:

However, our current scenario is tomcat + Netty, so what?

At this time, make a breakpoint in the construction method of NioSocketChannel, send a request, find the breakpoint in the construction method of NioSocketChannel, and observe the thread stack (from the bottom to the top):

As you can see, after Tomcat - > spring - > reactor - > reactor netty - > netty, we finally created NioSocketChannel.

The situation here is a little complicated. We will analyze it in detail later.

Step 18: memory leak

As can be seen from the above, Tomcat finally gave the request processing to Netty, but why did the memory leak? That remains a problem.

After my comparison and detection, the problem is still in the code in step 12. When using normal Netty requests, a task will be added to NioEventLoop in fireChannelRead(), called MonoSendMany.SendManyInner.AsyncFlush:

final class AsyncFlush implements Runnable {
    @Override
    public void run() {
        if (pending != 0) {
            ctx.flush();
        }
    }
}

This is used to write out the data in the write buffer (after reading and writing). At the same time, the data in the write buffer will be cleaned up. That is to say, the client can receive the response result only after calling this method. Instead, tomcat + Netty is used When this task is not performed, the data is sent to the client (it is speculated that it may be sent through tomcat connection instead of NioSocketChannel itself). This is a legacy problem. I will study it later. Now my mind is a bit messy.

summary

Although the whole code of this production event is relatively simple, it has been carried out for a long time. Now we summarize several points:

  1. Don't easily doubt the open source framework, especially the people like Spring who use it more often, doubt that it is easy to bias themselves, but it's not that don't doubt either;
  2. When you can't find the cause of the problem, you can take a rest, relax and change your mind;
  3. Why can tomcat and Netty coexist in Spring Cloud Gateway? This is a problem. We should give an issue to the official. When it is detected that both exist at the same time, it is better to make the program unable to afford directly;

At present, I am preparing for Netty column, and the old fellow are waiting for me.

Tags: Programming Netty Tomcat Spring Java

Posted on Mon, 18 May 2020 20:57:27 -0700 by Skaara