-
Firstly, thanks for the great project! We're now trying to move to dragonfly from Redis (KeyDB to be more precise). We use redis for 4 different purposes, one of which is Sidekiq. We've moved to dragonfly in all of our situations, and everything looked great, improvement in resources was significant (12 CPUs on KeyDB vs 3 CPUs on Dragonfly). But then we've faced some problems in Sidekiq: every some time Dragonfly started drain memory (showing OOMs in lua scripts, stopping sending jobs to Sidekiq, then rejecting connections at all) and only restart could fix that. Dashboard of this situation look like this: INFO in this situation at 17:43
I should note, that we are using I started to investigate this situation, and it looks like that the root cause is our quantity of queues. Thus, we have about 20 queues, which are loaded unevenly (about 5 queues have more than 80% of load) plus we have more than 190 sidekiq-alive queues (one for each sidekiq instance) that only used to process 1 job in some time. So, when the problem begins, I see continuous grow in scheduled and enqueued jobs, which are mostly SidekiqAlive jobs as I can see from our Sidekiq dashboard: Then, Dragonfly starts to put these warnings to log:
As I understand, somewhere here Sidekiq workers begin to stop receiving new jobs and therefore queues and memory consumption continue to grow leading to OOMs: first in scripts
and then on other clients commands and connections. My best guess now is that something is happening when the number of queues grows (as sidekiq-alive-queues could appear and dissappear on scaling) and then everything is messed up. Our Dragonfly configuration is like recommended in your docs (we tried to increase maxmemory and threads, but it randomly affects this case and ultimately leads to the same thing):
Now we had to rollback to KeyDB for Sidekiq (as we didn't have this problems on it) and I am here to share our situation and looking for advice how to handle our number of queues and uneven load on this queues. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 12 replies
-
Kudos on providing all the info, probably the first time I see almost all the data needed to help with the issue :) A few comments:
I will provide more hints once you be able to fix these issues. |
Beta Was this translation helpful? Give feedback.
Ok, latencies are really high. You can futher increase
--interpreter_per_thread
untillua_blocked_total
statistics will stop heavily increasing, but besides this I can not help much. This should improve latency imho.Seems that your lua scripts touch multiple hashtags that are spread across multiple threads but I do not know why and whether this can be fixed.
tx_with_freq:4887835546,13646581,1077753,277821242,23,57,52,2114309,0,4,4,0
shows you have lots of multi-threaded transactions and these may have high latency and if they are contended on the same queue, the latency starts aggregating. As I said, that's the maximum (community) support we can give at this point.