PrometheusBasedResourceLimitChecks does not obey limits when the checked entry is not cached #3659

harism · 2024-10-07T07:32:10Z

PrometheusBasedResourceLimitChecks method isMessageLimitReached for example does not give any timeout for cache read needed situation on its code line 204 - this causes the byte sized payload limit to reset at the cache invalidate interval and the retrieve request will go to the else branch always and let payload pass through even over the limits.

Should this line 204 value.isDone() check be changed with something having a short wait timeout too? Or is this intentional not to wait at all for the cache entry getting available?

The text was updated successfully, but these errors were encountered:

sophokles73 · 2024-10-11T05:56:02Z

Should this line 204 value.isDone() check be changed with something having a short wait timeout too? Or is this intentional not to wait at all for the cache entry getting available?

Not waiting for the query to complete is intentional because we want to rather accept a few operations that would exceed the limit than blocking (and eventually failing) otherwise appropriate operations.

We could introduce a (very small) timeout instead of immediately falling back to accepting the operation but this value would need to be adapted to the concrete setup (and the expected responsiveness of Prometheus) by the admin. IMHO this will require quite some experience and testing to find a suitable value.

From our point of view this simply was not worth the effort when we created the code. WDYT?

harism · 2024-10-11T06:54:27Z

The problem is that this occurs now every time the cache entry is not available, by default after every 60 seconds. And after this minute there is again accepted requests over the limit until new cache entry is present. At a very slow pace tenant this limit check will never trigger actually with current defaults.

But if this is intentional, would it be ok to add a configuration flag to change the behaviour, and keep this short wait period for waiting that missing cache entry gets created what I started to implement already? I'm not so sure does the timeout value need to be optimally short, some second or two would suffice in my opinion similar value as the whole Prometheus query timeout is, which I would rather rely on as the backoff timeout here too, the Prometheus query triggers timeout and any other errors there might arise sooner.

sophokles73 · 2024-10-11T07:39:10Z

At a very slow pace tenant this limit check will never trigger actually with current defaults.

For such a client you will most likely not be concerned about exceeding the data limit ;-)
However, on the other hand, if you have a tenant with thousands of devices that send data at quite a high rate, you will block (potentially) hundreds to thousands of device requests which pile up (in memory) waiting for the cache entry to become available.

I would be ok to add a configuration option for a duration that will be waited for the cache entry to become available instead of immediately falling back to the default value. This option would need to be off by default then and should be documented appropriately, i.e. describing the impact that enabling this option might have (as I described above). Would that work for you?

harism · 2024-10-12T18:16:01Z

I need to admit that after pondering over the current implementation and the rationale for it, I've started to look into this issue from totally different perspective actually. And adding a new enforced shorter timeout value for the cache get isn't necessarily the best option for any of us.

But I didn't notice this tiny detail earlier on how Caffeine cache is initialised currently;

    .expireAfterWrite(cacheTimeout)
    .refreshAfterWrite(cacheTimeout.dividedBy(2));

And how do you see such an idea that refreshAfterWrite would be set to the current cacheTimeout value and expireAfterWrite would be some non-configurable hour(s) long duration instead? This would extend the limits exceeding cache entries live span, but also maintain the idea to reload cache entries after the configured cacheTimeout always.

sophokles73 · 2024-10-13T12:42:53Z

And how do you see such an idea that refreshAfterWrite would be set to the current cacheTimeout value and expireAfterWrite would be some non-configurable hour(s) long duration instead?

I'd rather not change the semantics of the cacheTimeout propery. However, I guess it would be helpful to be able to configure the divisor to use for the refresh interval. That way, you could configure the cache timeout to a large value (e.g. hours) and still let the value be refreshed after a reasonably short amount of time (if accessed). WDYT?

harism · 2024-10-13T21:46:00Z

However, I guess it would be helpful to be able to configure the divisor to use for the refresh interval. That way, you could configure the cache timeout to a large value (e.g. hours) and still let the value be refreshed after a reasonably short amount of time (if accessed). WDYT?

Yes this is something that suffices for my needs at least, to harden exceeding especially the transfer limit slightly.

I will prepare PR for this.

harism added a commit to harism/hono that referenced this issue Oct 7, 2024

[eclipse-hono#3659] Fix Prometheus limit check to use read timeout

34e21ae

harism added a commit to harism/hono that referenced this issue Oct 9, 2024

[eclipse-hono#3659] Fix Prometheus limit check to use read timeout

886b1c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PrometheusBasedResourceLimitChecks does not obey limits when the checked entry is not cached #3659

PrometheusBasedResourceLimitChecks does not obey limits when the checked entry is not cached #3659

harism commented Oct 7, 2024 •

edited

Loading

sophokles73 commented Oct 11, 2024

harism commented Oct 11, 2024 •

edited

Loading

sophokles73 commented Oct 11, 2024

harism commented Oct 12, 2024 •

edited

Loading

sophokles73 commented Oct 13, 2024

harism commented Oct 13, 2024

PrometheusBasedResourceLimitChecks does not obey limits when the checked entry is not cached #3659

PrometheusBasedResourceLimitChecks does not obey limits when the checked entry is not cached #3659

Comments

harism commented Oct 7, 2024 • edited Loading

sophokles73 commented Oct 11, 2024

harism commented Oct 11, 2024 • edited Loading

sophokles73 commented Oct 11, 2024

harism commented Oct 12, 2024 • edited Loading

sophokles73 commented Oct 13, 2024

harism commented Oct 13, 2024

harism commented Oct 7, 2024 •

edited

Loading

harism commented Oct 11, 2024 •

edited

Loading

harism commented Oct 12, 2024 •

edited

Loading