Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrometheusBasedResourceLimitChecks does not obey limits when the checked entry is not cached #3659

Open
harism opened this issue Oct 7, 2024 · 6 comments

Comments

@harism
Copy link
Contributor

harism commented Oct 7, 2024

PrometheusBasedResourceLimitChecks method isMessageLimitReached for example does not give any timeout for cache read needed situation on its code line 204 - this causes the byte sized payload limit to reset at the cache invalidate interval and the retrieve request will go to the else branch always and let payload pass through even over the limits.

Should this line 204 value.isDone() check be changed with something having a short wait timeout too? Or is this intentional not to wait at all for the cache entry getting available?

@sophokles73
Copy link
Contributor

Should this line 204 value.isDone() check be changed with something having a short wait timeout too? Or is this intentional not to wait at all for the cache entry getting available?

Not waiting for the query to complete is intentional because we want to rather accept a few operations that would exceed the limit than blocking (and eventually failing) otherwise appropriate operations.

We could introduce a (very small) timeout instead of immediately falling back to accepting the operation but this value would need to be adapted to the concrete setup (and the expected responsiveness of Prometheus) by the admin. IMHO this will require quite some experience and testing to find a suitable value.

From our point of view this simply was not worth the effort when we created the code. WDYT?

@harism
Copy link
Contributor Author

harism commented Oct 11, 2024

The problem is that this occurs now every time the cache entry is not available, by default after every 60 seconds. And after this minute there is again accepted requests over the limit until new cache entry is present. At a very slow pace tenant this limit check will never trigger actually with current defaults.

But if this is intentional, would it be ok to add a configuration flag to change the behaviour, and keep this short wait period for waiting that missing cache entry gets created what I started to implement already? I'm not so sure does the timeout value need to be optimally short, some second or two would suffice in my opinion similar value as the whole Prometheus query timeout is, which I would rather rely on as the backoff timeout here too, the Prometheus query triggers timeout and any other errors there might arise sooner.

@sophokles73
Copy link
Contributor

At a very slow pace tenant this limit check will never trigger actually with current defaults.

For such a client you will most likely not be concerned about exceeding the data limit ;-)
However, on the other hand, if you have a tenant with thousands of devices that send data at quite a high rate, you will block (potentially) hundreds to thousands of device requests which pile up (in memory) waiting for the cache entry to become available.

I would be ok to add a configuration option for a duration that will be waited for the cache entry to become available instead of immediately falling back to the default value. This option would need to be off by default then and should be documented appropriately, i.e. describing the impact that enabling this option might have (as I described above). Would that work for you?

@harism
Copy link
Contributor Author

harism commented Oct 12, 2024

I need to admit that after pondering over the current implementation and the rationale for it, I've started to look into this issue from totally different perspective actually. And adding a new enforced shorter timeout value for the cache get isn't necessarily the best option for any of us.

But I didn't notice this tiny detail earlier on how Caffeine cache is initialised currently;

    .expireAfterWrite(cacheTimeout)
    .refreshAfterWrite(cacheTimeout.dividedBy(2));

And how do you see such an idea that refreshAfterWrite would be set to the current cacheTimeout value and expireAfterWrite would be some non-configurable hour(s) long duration instead? This would extend the limits exceeding cache entries live span, but also maintain the idea to reload cache entries after the configured cacheTimeout always.

@sophokles73
Copy link
Contributor

And how do you see such an idea that refreshAfterWrite would be set to the current cacheTimeout value and expireAfterWrite would be some non-configurable hour(s) long duration instead?

I'd rather not change the semantics of the cacheTimeout propery. However, I guess it would be helpful to be able to configure the divisor to use for the refresh interval. That way, you could configure the cache timeout to a large value (e.g. hours) and still let the value be refreshed after a reasonably short amount of time (if accessed). WDYT?

@harism
Copy link
Contributor Author

harism commented Oct 13, 2024

However, I guess it would be helpful to be able to configure the divisor to use for the refresh interval. That way, you could configure the cache timeout to a large value (e.g. hours) and still let the value be refreshed after a reasonably short amount of time (if accessed). WDYT?

Yes this is something that suffices for my needs at least, to harden exceeding especially the transfer limit slightly.

I will prepare PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants