64830 – HTTP2 : GOAWAY sent with Protocol Error and Frame Size Error
Bug 64830 - HTTP2 : GOAWAY sent with Protocol Error and Frame Size Error
Summary: HTTP2 : GOAWAY sent with Protocol Error and Frame Size Error
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 9.0.39
Hardware: HP Linux
: P2 regression (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
: 64828 64829 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-10-20 06:05 UTC by Arshiya
Modified: 2021-01-22 16:53 UTC (History)
0 users



Attachments
FRAME_SIZE_ERROR PCAP (164.63 KB, application/vnd.tcpdump.pcap)
2020-10-20 06:06 UTC, Arshiya
Details
PROTOCOL ERROR (55.95 KB, application/vnd.tcpdump.pcap)
2020-10-20 06:07 UTC, Arshiya
Details
JMX file to reproduce issue (99.74 KB, application/xml)
2020-10-20 06:08 UTC, Arshiya
Details
Sample application to reproduce the issue (492.82 KB, application/x-zip-compressed)
2020-10-20 06:10 UTC, Arshiya
Details
PCAP with Goaway (167.97 KB, application/vnd.tcpdump.pcap)
2020-12-11 04:26 UTC, Arshiya
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Arshiya 2020-10-20 06:05:00 UTC
Sub-Component - Coyote

OS : Redhat Linux

Overview:

Embedded Tomcat version 9.0.39 is implemented to transport http/2 packets between 2 systems (h2c connection).
Http2 packets of payload size 55KB sent from the client are processed asynchronously .
The content-type of request/response is application/json. For few requests Tomcat sends GOAWAY with reason that seems irrelevant .

1)GOAWAY with PROTOCOL ERROR : The content length header value [55294] does not agree with the size of the data received [55295]
Where, both the content length header value and the size of the data received is 55295 bytes  
2)GOAWAY with FRAME_SIZE_ERROR : The payload is [6386210] bytes long but the max frame size is [16384]
Where , the payload size is only around 55 KB
Please find attached the PCAPs for reference.

We have tried and reproduced these exceptions with a sample code implementing embedded tomcat version 9.0.39 with payload size of around 55KB . Please find below the steps to reproduce. I have attached the sample code / input JMX file to reproduce the issue. 
 
Steps to Reproduce:

With JMeter as simulation client, on configuring 700 threads (700 connections) to connect towards Tomcat Server 9.0.39 embedded in our system and on sending 20 requests per second with payload of 55KB (same request with just one json value sent uniquely via the Random number generator)and on letting the test run in an infinite loop , few requests are getting timed out . On analysing the PCAP for the particular request we see that tomcat sends GOAWAY with  PROTOCOL ERROR / FRAME_SIZE_ERROR where the reason seems to be incorrect.

Client:
JMeter 5.3 with additional HTTP2 sampler.
No of threads: 700
Ramp-up period:10 seconds	
Loop:Infinite
Payload size: around 55KB
Constant Throughput Timer added to limit the tps to 20.
Random Variable Generator added to the JSON request to uniquely identify for which request the exception is printed and to map it in the PCAP collected .
Response Timeout : 5000 ms

SERVER:
Sample application attached .
MaxThreads configured in tomcat is 200(all other parameters are the tomcat defaults).
The input requests are processed asynchronously with 40 threads.

Build:
Embedded tomcat 9.0.39

Resolution/Clarification requested:
Can you please explain the root cause for this . 

Thanks in advance.
Comment 1 Arshiya 2020-10-20 06:06:01 UTC
Created attachment 37513 [details]
FRAME_SIZE_ERROR PCAP
Comment 2 Arshiya 2020-10-20 06:07:33 UTC
Created attachment 37514 [details]
PROTOCOL ERROR
Comment 3 Arshiya 2020-10-20 06:08:37 UTC
Created attachment 37515 [details]
JMX file to reproduce issue
Comment 4 Arshiya 2020-10-20 06:10:37 UTC
Created attachment 37516 [details]
Sample application to reproduce the issue
Comment 5 Arshiya 2020-10-22 04:36:37 UTC
Any update on this please !
Comment 6 Remy Maucherat 2020-10-22 06:53:39 UTC
Pinging for updates on BZs is not allowed. This is not the first time you are doing this.
Comment 7 Remy Maucherat 2020-10-23 12:59:10 UTC
*** Bug 64829 has been marked as a duplicate of this bug. ***
Comment 8 Remy Maucherat 2020-10-23 13:24:57 UTC
*** Bug 64828 has been marked as a duplicate of this bug. ***
Comment 9 Mark Thomas 2020-10-29 17:25:20 UTC
What hardware are you running this on? I've been running the test case locally for 15+ mins and I haven't seen a single error reported. Details are hardware used, memory allocated to Tomcat and JMeter, network between client and server would likely be helpful. If we can't recreate the issue, we can't debug it.
Comment 10 Mark Thomas 2020-10-29 19:17:22 UTC
Moving to NEEDINFO as more information about the provided test case is required to reproduce the issue.
Comment 11 Arshiya 2020-11-05 05:12:36 UTC
Mark ,
In production we see the issue reported in this bug and bug 64828 on Linux.
I had reproduced the issue with the attached source code on my local windows machine (Windows 10 - 64 bit / 16 GB RAM).
JMeter version :5.3 with Xmx:5g
Embedded tomcat application :Default Xmx

Thanks in advance !!
Comment 12 Mark Thomas 2020-11-05 11:42:55 UTC
Thanks. I can try testing on a similar spec machine (although it will be a VM).

How long does the test case have to be running before you start to see errors?
Comment 13 Arshiya 2020-11-05 16:30:21 UTC
Thanks Mark .
Within 15 to 20 minutes of running the case I was able to see the errors , Please let us know if you need more information .
Comment 14 Mark Thomas 2020-11-09 19:15:28 UTC
Success. This was trivial to recreate on Windows. Must be a timing thing. Fix on the way.
Comment 15 Mark Thomas 2020-11-09 19:29:04 UTC
Fixed in:
- 10.0.x for 10.0.0-M10 onwards
- 9.0.x for 9.0.40 onwards
- 8.5.x for 8.5.60 onwards
Comment 16 Arshiya 2020-11-10 04:20:42 UTC
Thank you soo much Mark for the fix ..

One clarification please, Were you able to reproduce and fix the payload related issue reported in bug 64828 as well ?

Thanks in advance !
Comment 17 Mark Thomas 2020-11-10 09:33:23 UTC
I didn't see any payload errors reported. The other errors occurred after a few seconds on Windows so I fixed those then ran a longer test (20 mins) where no errors where observed.

I don't immediately see how the issues I fixed could led to the payload errors described. It would be informative if you were to test the latest 9.0.x code and report back.
Comment 18 Arshiya 2020-12-11 04:24:16 UTC
Hi Mark , 
I tried running the same test program with the latest tomcat 9.0.41 jars on windows machine .

1.)I am still able to see the incomplete payload related exceptions and GOAWAY (PROTOCOL_ERROR and FRAME_SIZE_ERROR) .

The test (payload size 55KB) ran for about 8 minutes where few requests had timed-out (could see in JMeter GUI). On filtering the requests in the PCAP captured with the unique identifier , I was able to find the trace for 3 requests with the below reason for GOAWAY (GOAWAY.pcap attached) 
 *) FRAME_SIZE_ERROR:The payload is [2128653] bytes long but the maximum frame size is [16384]
 *) PROTOCOL_ERROR  :Connection [7004], Stream [1], The content length header value [56,465] does not agree with the size of the data received [56,466]
 *) PROTOCOL_ERROR  :Connection [7092], Stream [1], The content length header value [56,466] does not agree with the size of the data received [56,466]


Specs:
Windows 10 
Processor:Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz 1.90GHz
RAM:16 GB
System Type : 64 bit

Please let me know in case of further inputs.
Thanks in advance!
Comment 19 Arshiya 2020-12-11 04:26:35 UTC
Created attachment 37600 [details]
PCAP with Goaway
Comment 20 Mark Thomas 2020-12-11 17:45:25 UTC
Looks like there is still an issue related to the HPACK decoder. I can see a debug log where the decoder reports a content-length that is not consistent with the value shown in the Wireshark trace. Still trying to figure out how this is happening.
Comment 21 Mark Thomas 2021-01-14 17:08:44 UTC
Found it. It wasn't in the HPACK decoder at all.

I've applied a patch and I can no longer get the test case to fail. Could you retest with a new 9.0.x build?
Comment 22 Mark Thomas 2021-01-22 16:53:17 UTC
I am going to assume that the recent commit fixed this. If this is not the case please do re-open but I think we'd need a new test case in that case as I can no longer trigger any errors with this one.