-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set a new default for the Puma thread count #50450
Comments
As discussed on Campfire, I've been saying for a long time that 5 threads is way too much for what I consider a well optimized application (as you mention, no slow queries, no N+1, no 3rd party API calls from the request cycle). The Overall the choice will be a latency vs throughput tradeoff, so there won't be a single setting that satisfies everyone, but we likely can find a better compromise than 5 threads per process (which supposes more than 80% IO which is nuts). |
Totally. Maybe we can get someone from the community to help here. Let's run all the public benchmarks that are available at different ratios. 1:1, 1:2, 1:3. Then see what things look like in terms of latency and throughput. In our testing, 1:1 proved to be the best on both, but that was with HEY and Basecamp. So let's see what might be true with other benchmarks. That marks this issue open for collaboration. If you're interested in helping us find the right default ratio, please post test results here. Thanks for helping! |
@noahgibbs worked on this for many years (although his work stopped ~3 years ago after losing sponsorship). One benchmark I recall of his showed that throughput on Discourse improved 25% when going from 1 to 5 threads.. I would certainly consider Discourse a highly-optimized Rails app, maybe one of the most highly optimized in the world. It still spends 25% of its time waiting on I/O. Noah's result accords pretty well with Amdahl's Law. Amdahl's Law shows that modest throughput gains can be obtained even when small parts of the program can be done in parallel - for example, a Rails application that spends 30% of it's time waiting on I/O will have 18% higher throughput with 2 threads than with 1. Of course, the second thread is the most costly thread of all: now you're multi-threaded, with all the bugs that can cause. But perhaps it is better to keep Rails multi-threaded by default, so that those bugs can occur and be seen and fixed often and early? What bugs would slip through is Rails became single-thread by default again? @byroot: I chose the default of 5 in Puma based on the benefit for a Rails app with 25-50% time spent in I/O wait (based on my experience looking at 100+ Rails apps perf dashboards, this covers 80% or more of prod apps). These apps receive a 30 to 65% improvement (respectively) in throughput with 5 threads, w/roughly a 25% increase in memory usage. If I round the numbers a bit, Noah's benchmark result could be said to show something like the expected speedup given by Amdahl's Law for a 25%-wait-in-I/O application (in the formula though, 25% wait-on-io translates to 25% improvement in throughput w/5 threads). If you work back through Amdahl's Law, you can produce the following table, to give you an idea of the tradeoffs here even in a low-I/O app:
|
Something doesn't add up though. Even assuming 50% IO wait, that means max throughput would be achieved with 2 threads, not 5. |
Imagine you have 10 requests to process. Each request is 0.5 seconds of a db call, followed by 0.5 seconds of CPU burn. 1 thread would take 10 seconds to process all those requests, one after the other. With 2 threads and one process, would it take 5 seconds? If you had 10 threads, you could fire all the db calls at once, wait 0.5 seconds, then do 5 seconds of serial CPU work, processing all requests in 5.5 seconds. That's a 81% speedup. If you had 2 threads, you would have to spend 2.5 seconds doing all the DB calls, and 5 seconds doing CPU work, for a total of 7.5 seconds. 5 threads, 6 seconds. That's where I got 5 threads from. 80% of the maximum possible benefit for a 50% I/O wait app. Amdahl's Law assumes you could keep adding threads and make the db calls go faster, which is where the analogy breaks down, but I find it's still quite accurate and a useful heuristic here. |
Yeah, I see what you mean. But I think a bit differently. In my mental model, a process has a capacity of 1s of CPU time per second. So with 5 threads and 50% IOs you are queuing 2.5 times as much work as you can chew. Which is fine to handle small spikes, but degrades terribly if you are under capacity. Also in your simplified model, all 5 requests are received at once and do 0.5s of IO followed by 0.5 of CPU time, in practice requests are received continuously and IO and CPU are much more intertwined than that. All this to say, you're not wrong, but I think you are focusing too much on throughput impact, and not enough on the latency impact (which itself impact throughput negatively). But again, it's of course a tradeoff, and depends on what your priorities are. |
Our testing on both HEY and Basecamp showed that when we have a higher thread count, we end up with some requests that really spike in latency. It was much easier to ensure a consistent latency with a lower thread count. But actually, now that I think about it, maybe that latency isn't from the threading itself, but from GC'ing? Would be nice to get to the bottom of that. In theory, I dig the idea that we have a higher thread count, and if we can keep that, without the latency swings, that's certainly better. What's the next step to verify these theories? |
I understood for your business that single thread works well but I think base camp and Shopify running a single thread (I think Shopify use unicorn fork, right?) is a risk for new threads bugs emerge I think default should be at least 2, even for avg apps 1 is better. |
It's the same thing. In this context GC is just CPU work, it pauses all threads.
That's my whole point, the higher the thread count, the higher the tail latency. You have to choose your prefered middle ground between maximizing throughput (hence reducing hosting cost) and minimizing latency (hence improving user experience). You can't have both.
That's my recommendation yes. |
You may want to look into @peterzhu2118 / shopify https://github.com/shopify/autotuner (presented at RailsWorld :) ) - it collects GC data between requests and recommends alignments to the GC engine. Aside from that IMHO when discussing Puma's default thread count, it's crucial to acknowledge its impact not just on web requests but also on background jobs, which may have different latency and IO characteristics. People often model things based on Rails/Puma defaults. With this change, should the default on connection pool change as well? |
Yeah, this discussion is just for web workers, background job workers are much less latency sensitive by definition, and maximizing throughput at the expense of latency make sense there. |
Yeah, I don't want to regress on thread safety. Who knows, maybe one day the great GLV is going to be gone 😄. But it just seems that 5 is too much for most people, if you care about latency more than throughput, and I think most should? It's easier to deal with throughput using multiple processes than it is to deal with latency. So maybe we just start with a change to 2? Then we can continue to document the factors that might lead someone to change that. |
I actually found peak throughput with 6 threads/process at the time for Discourse -- and that benchmark would have been in 2017 or 2018. But yeah, that many threads is usually not great for a production app trying to keep latency low. It was purely optimising for big-batch throughput. |
Yeah, for us, the latency tail just got worse and worse the more threads we added. But it would be nice to get scientific about this! |
Do we need to line up any other thread pool counts if we make the change from 5 to 2? |
I've updated the TechEmpower benchmarks to use 3 threads instead of 5.
|
On my app at work I found that if I decreased the thread count from 5 to 4, response times improved. I then decreased them from 4 to 3 and they improved again. 3 to 2 did not improve response times, so I left it at 3. For our app, 3 threads results in about twice as fast response times as 5 threads while retaining the same throughput. Our app has fast DB queries, avoids web requests in our controller actions (offloading those to activejob). We have a JSON API, a bunch of ERB pages, and we have a CDN in front of our site so we proxy a lot of our ActiveStorage images. I know that your ideal thread count highly depends on what you're doing, but I think 1 is too low of a default. 3 has served us well. |
@p8 I think whatever benchmarks we run has to include a histogram of individual request latency. That's the main issue here. The original default of 5 was optimized primarily for throughput, not tail latency. So we should find a way to quantify the tail latency. |
I don't think TechEmpower is particularly interesting for this discussion, as it's way too simple to reflect reality: https://github.com/TechEmpower/FrameworkBenchmarks/blob/6e4fa674519771a1833e792d5d69f0043e5bebf3/frameworks/Ruby/rails/app/controllers/hello_world_controller.rb But overall, choosing an application to benchmark is not all, there is also the setup. When doing trivial SQL queries, a local database vs one on another host on LAN will make a major difference. ( So in the end, I think we could make this decision purely based on a synthetic application that can be configured to have a certain |
The other thing to bear in mind is that the average rails app doesn't have as heavily tuned database queries. From being in some Rails-related slacks, missing indexes and inefficient queries seem to be the norm. Basecamp preferring 1 thread feels like a real outlier due to having knowledgeable staff that's able to make those queries return incredibly quickly. The average rails app benefits from more threads to offset the inefficient queries. Then as they improve their queries, they need less threads, because there's less IO wait. |
There's also a huge misunderstanding of Ruby's concurrency model among newcomers. So often in these slacks devs complain about slow response times. I ask how many threads they're running and sometimes they say numbers as high as 30. They want more throughput so they crank up the thread count. I'm not sure how to best educate devs on how to tune their thread count (other than recommend Nate Berkopec's courses), but I feel that setting the thread count to 1 will make devs definitely want to crank that number up (rightfully so), and not have any idea how high it should actually be. |
I'm not actually advocating for a default of 1. I'd like to put some documentation into our default puma config explaining some of the very basics here, and at least propose 1 as a setting for folks. But it sounds like we're essentially talking about whether the new default should be 2 or 3. I don't have a strong opinion there, except that 5 isn't it. |
Yeah, I think we're pretty much in agreement for either 2 or 3.
That is true, and that's probably also why the popular belief that Ruby performance doesn't matter much because it's all IO anyway is so commonplace. When you start optimizing a Rails application, the biggest problems you uncover are always query (or generally IO) related. But even in applications with such issues, I've never really witnessed enough IOs to justify 5 threads. The only use case that would justify it would be an application whose main endpoint is essentially a proxy to another API. I've seen this one or twice, and there yes you can crank up the threads (or move to async). But I don't think we should really be optimizing for this kind of situations, and 2 or 3 threads should still be plenty to keep a high enough utilization for apps having a few N+1 and a couple slow query issues. Also it's really just a default, even more, it's generated when you create the app, so it's really easy to edit. |
I'd personally vote for 3 because on my app, 2 threads had identical latency as 3, but 3 had more more throughput. I'm very performance minded, so I find it hard to believe that the average app running defaults would have low enough IO that they'd prefer just 2 threads. As you said, N+1 queries are incredibly common (sometimes in loops that become like 5N+1) and a great way to rack up the IO. I don't think the defaults should be optimized for apps doing HTTP requests in a controller action. That should be highly discouraged as it ruins throughput and is a good way to get DDoSed if the 3rd party site goes offline (happened to me once many years ago, never again). |
It's not that low at all really. If you assume a 50% IO ratio in the average request, you only need two concurrent requests to saturate one processes, so two threads. If of course depends on how much data transformation you are doing at the Ruby layer, but view templating or JSON serialization (plus all the Active Record de-serialization) can easily take more time than several decently indexed DB queries. But of course it's all about our personal experiences and perception, as there's only a handful of open source Rails app out there to look at, and nothing guarantee that they are representative of what is being done in the wild, and more importantly, having the source is only one part, you need the whole production setup and traffic pattern to really measure the IO/CPU ratio. So all this to say that whatever value we choose, will be largely based on a guess. |
Maybe we could reach out to someone in AppSignal (cc @thijsc) and get some aggregated metrics on this from a set of commercial Rails projects. WDYT? |
If they have the data I'd love to look at it, but we'd need to be very careful about how it's collected because GVL contention is generally mistaken for IO time by most instrumentation. So apps that use too many thread can appear as being more IO heavy that they actually are. |
The main change is the default number of threads is reduced from 5 to 3 as discussed in rails#50450 Pending a potential future "Rails tuning" guide, I tried to include in comments the gist of the tradeoffs involved. I also removed the pidfile except for development. It's useful to prevent booting the server twice there but I don't think it makes much sense in production, especially [since Puma no longer supports daemonization and instead recommend using a process monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize). And it makes even less sense in a PaaS or containerized world.
Alright, I opened #50669 with this change plus a couple smaller ones and a revamp of the comment that tries to gives the gist of the tradeoff involved. I'll leave it open for a bit in case there is extra feedback, but after that I believe we can finally close this issue. |
The main change is the default number of threads is reduced from 5 to 3 as discussed in rails#50450 Pending a potential future "Rails tuning" guide, I tried to include in comments the gist of the tradeoffs involved. I also removed the pidfile except for development. It's useful to prevent booting the server twice there but I don't think it makes much sense in production, especially [since Puma no longer supports daemonization and instead recommend using a process monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize). And it makes even less sense in a PaaS or containerized world.
The main change is the default number of threads is reduced from 5 to 3 as discussed in rails#50450 Pending a potential future "Rails tuning" guide, I tried to include in comments the gist of the tradeoffs involved. I also removed the pidfile except for development. It's useful to prevent booting the server twice there but I don't think it makes much sense in production, especially [since Puma no longer supports daemonization and instead recommend using a process monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize). And it makes even less sense in a PaaS or containerized world.
AFAICT, the benchmark only goes up to 10 concurrent requests (VUs? Please correct me if I'm wrong). You'd have to open up the throttle a bit to see falcon really fly. In any case, I have not made any misleading claims and falcon is well capable of 5000 fibers if you make 5000 concurrent requests. In fact I gave a live demonstration of 1 million fibers connected to a single Falcon server process - as you can imagine it does start to choke a bit at that level, but it's definitely possible. That being said, coming back to puma, even if we have a thread pool capable of up to a few hundred threads, it would be good enough. It would also potentially take advantage of the MN scheduler (which is just the same coroutines that I implemented for fibers), and it also makes it easier for puma to support long running streaming requests, e.g. WebSockets, without the need for a
Actually, it's both, since the fiber scheduler mostly avoids GVL contention and explicitly schedules work. I spent a huge effort optimising the request and response handling to ensure the minimum latency. |
The main change is the default number of threads is reduced from 5 to 3 as discussed in rails#50450 Pending a potential future "Rails tuning" guide, I tried to include in comments the gist of the tradeoffs involved. I also removed the pidfile except for development. It's useful to prevent booting the server twice there but I don't think it makes much sense in production, especially [since Puma no longer supports daemonization and instead recommend using a process monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize). And it makes even less sense in a PaaS or containerized world.
The main change is the default number of threads is reduced from 5 to 3 as discussed in rails#50450 Pending a potential future "Rails tuning" guide, I tried to include in comments the gist of the tradeoffs involved. I also removed the pidfile except for development. It's useful to prevent booting the server twice there but I don't think it makes much sense in production, especially [since Puma no longer supports daemonization and instead recommend using a process monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize). And it makes even less sense in a PaaS or containerized world.
Lengthy discussion on default thread count, web concurrency, and pid files in Set new default for the Puma thread count rails/rails#50450 Update the default Puma configuration rails/rails#50669
This guide explains major concurrency and performance principles for Puma and CRuby, when to possibly not use Puma, and how to do basic load testing for tuning production performance settings. Incorporates comments from Rails issue rails#50450, PR rails#50669 and feedback from Jean Boussier. Co-authored-by: Jean byroot Boussier <jean.boussier+github@shopify.com>
This guide explains major concurrency and performance principles for Puma and CRuby, when to possibly not use Puma, and how to do basic load testing for tuning production performance settings. Incorporates comments from Rails issue rails#50450, PR rails#50669 and feedback from Jean Boussier. Co-authored-by: Jean byroot Boussier <jean.boussier+github@shopify.com>
This guide explains major concurrency and performance principles for Puma and CRuby, when to possibly not use Puma, and how to do basic load testing for tuning production performance settings. Incorporates comments from Rails issue rails#50450, PR rails#50669 and feedback from Jean Boussier. Co-authored-by: Jean byroot Boussier <jean.boussier+github@shopify.com>
We hit this issue in production a while back, to me it feels like (multi threaded Puma) Ruby is not prioritizing threads which have IO on them. Alas as we'd found a fix and this ate all available time we had I wasn't able to put more time into the underlying cause. Benchmarking a minimal local reproduction to confirm this was pretty trivial, we run "eat CPU" requests (which simulates the production load of mixed IO and CPU bound - ERB mostly) and simultaneously start up a benchmarking tool making many trivial IO based requests (networking, we used redis ping). The runtime for threads vs workers indicates something is really not okay with however things are being scheduled. We saw up to 100ms jitter in production Redis/SQL requests and confirmed with our cloud provider via telemetry that actual responses were < 8ms peak. Switching to multi worker puma, using a to 1:3 thread ratio, and moving to memory optimized instances solved it. |
@lypanov 100ms jitter comes from the thread scheduler in Ruby. https://ivoanjo.me/blog/2023/07/23/understanding-the-ruby-global-vm-lock-by-observing-it/ is a great blog post/presentation about the issue IIRC. |
This guide explains major concurrency and performance principles for Puma and CRuby, when to possibly not use Puma, and how to do basic load testing for tuning production performance settings. Incorporates comments from Rails issue rails#50450, PR rails#50669 and feedback from Jean Boussier. Co-authored-by: Jean byroot Boussier <jean.boussier+github@shopify.com>
This guide explains major concurrency and performance principles for Puma and CRuby, when to possibly not use Puma, and how to do basic load testing for tuning production performance settings. Incorporates comments from Rails issue rails#50450, PR rails#50669 and feedback from Jean Boussier. Co-authored-by: Jean byroot Boussier <jean.boussier+github@shopify.com>
With an I/O heavy application, we do this depending on available memory:
|
Rails 7.2 sets the default max threads to 3 (https://github.com/rails/rails/blob/v7.2.0/railties/lib/rails/generators/rails/app/templates/config/puma.rb.tt#L23). For more details, see the discussion: rails/rails#50450
We currently set the default Puma thread count to 5. That's very high. It assumes you will massively benefit from a bunch of inline 3rd party calls or super slow DB queries. It's not a good default for apps with quick SQL queries and 3rd party calls running via jobs, which is the recommended way to make Rails applications.
At 37signals, we run a 1 worker to 1 thread ratio. That's after extensive testing. It provided the best latency, at some small cost to ultimate throughput. Maybe that's too much or has on negative effects on resource-starved systems, like Heroku Dynos. But it's clear that a default of 1:5 is not right.
So let's find a way to benchmark our way to a good, new default that works for most people, most of the time.
cc @byroot
The text was updated successfully, but these errors were encountered: