-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate serving Elastic APM and Jaeger inputs on the same port #3984
Comments
In theory we could use https://godoc.org/google.golang.org/grpc#Server.ServeHTTP, but by all accounts it is not well supported, and the grpc-go maintainers actively discourage its use. Seems to me that would be better than cmux as it would provide an easy way to switch on Content-Type to divert to a gRPC handler. Alas. If we create a cmux around a TLS listener, than the tls.Conn gets wrapped and net/http.Server will not set req.TLS, and will not trigger the TLSNextProto handler: https:/golang/go/blob/8696ae82c94f0a7707cbbbdf2cec44e93edf5b23/src/net/http/server.go#L1810 That means net/http's transparent HTTP/2 support would be broken, and we would have to explicitly handle the case the negotiated protocol is "h2" but not gRPC. e.g. Go's net/http client, which transparently negotiates "h2" by default. It would be possible to create an in-memory reverse proxy for http2, muxing on content-type. We could do this by defining our own TLSNextProto handler that reads/buffers http2 frames until we have the headers, similar to cmux, and then replaying buffered frames to either grpc.Server or http2.Server, depending on the content-type. I haven't seen this approach in the wild, so it could be a crackful idea. Overhead might also be an issue. For gRPC we would need to reimplement https:/grpc/grpc-go/blob/9106c3fff5236fd664a8de183f1c27682c66b823/credentials/tls.go#L114, to work with an existing tls.Conn. |
Can you elaborate a bit on the concrete motivation for this? What exactly is to be simplified, etc. |
One ~concrete example: sharing a single port means being able to reuse firewall and reverse proxy rules. This is not about simplifying APM Server, but its deployment. |
Would it make sense to preemptively ask the cloud team to allocate additional ports for each cluster in order to add additional endpoints/protocols in the future if multiplexing is clumsy? Allocating additional port would also mean an update of our docs so that users will have clear guidance to setup their firewalls including preparing these reserved ports. The change of firewall rules is definitively a pain we want to avoid but, for a transition and for a new protocol, I think we can afford this. |
I believe @graphaelli is going to do that in the near future. |
Expanding a bit on #3984 (comment), since others may look to this issue for a summary: The biggest technical challenges we have with using cmux (or something like it) are:
Re (1): because of RUM we default to neither requiring nor requesting TLS client certificates. Even if we request (but don't require) then RUM will prompt the browser user, which is an awful user experience. If we don't request, then we can't rely on TLS client certificate auth for other clients. This is important for Jaeger, where TLS client certificates are the primary auth mechanism. Re (2): we can address this by introducing a sort of in-memory HTTP/2 reverse proxy, as mentioned above. cmux does some of this internally [0], but I'm not convinced it's complete. There's a fair bit of complexity here, and something I would prefer to avoid at least in the near future. [0] https:/soheilhy/cmux/blob/8a8ea3c53959009183d7914522833c1ed8835020/matchers.go#L220
EDIT: the method names are generic, but the service names are not. My mistake, there's no concern here. |
btw on 1) with the proxy acting as a TLS mitm, it would need changes to support passing through client certs for auth if thats something that's on your roadmap. also for the following:
I would expect all connections coming from the cloud proxy will be HTTP/2 since https:/elastic/cloud/pull/61959 was merged. |
FYI related slack channel #proj-elastic-intake-protocols |
I spent a little time looking into this option: https:/axw/apm-server/blob/mux-grpc/beater/internal/h2mux/mux.go I have not tested the overhead yet. I expect it will be reasonable except per for very short-lived connections, but I would like to test it out. Aside from this, what remains is to implement muxing when TLS is disabled. The above relies on ALPN (a TLS feature). When TLS is disabled, Jaeger can still be used in h2c (HTTP/2 Cleartext) mode. In that case we can probably just use cmux as-is. Assuming all that goes well, I suggest the following path forward:
I think we should also consider deprecating |
I ended up doing something a bit simpler than using cmux: instead, we can rely on net/http's Hijacker interface to identify h2c prior knowledge requests, and send them straight through to the gRPC server. This makes the code for handling TLS/non-TLS neater. Side note: none of this will help us if we ever want to mux non-HTTP based protocols. Microbenchmarks show that the the overhead is pretty marginal for long-lived connections, a bit more pronounced for short-lived connections:
The "longrunning" benchmarks create a single connection and perform a series of RPCs. The "shortlived" benchmarks create a series of connections over which we perform a single RPC. |
@axw would it be interesting to see how the Opentelemetry Collector Receiver for the OTLP protocol multiplexes gRPC and HTTP on the same endpoint? See https:/open-telemetry/opentelemetry-collector/tree/master/receiver/otlpreceiver#writing-with-httpjson |
@cyrille-leclerc opentelemetry-collector does not serve HTTP and gRPC on a single port; you must define two different "endpoints" (host:port) for them. It previously used cmux, but it was removed because it prevented mTLS. See open-telemetry/opentelemetry-collector#1256 (I intend to extract my code into a separate repo, and offer it as a solution to that issue.) |
Good catch, I didn't connect the dots between the collector receiver for the Otel Protocol and the conversation we saw about the removal of multiplexing of HTTP and gRPC. |
master...axw:beater-gmux is functional, needs unit and system tests. I've done some basic manual testing, excluding configuring TLS.
|
@axw a lot of the above goes above my head - so apologies if I could have figured this one out. Can you clarify what the connection re-use semantics are? I.e., can the same connection be used for both apm server and jaeger requests? |
@bleskes no worries, it's not obvious. At the moment we claim an entire connection as either gRPC or not-gRPC. So no, it would not work to reuse the connection for both Jaeger and Elastic APM protocols. Sorry, I think I missed the point of your earlier questions around this. |
this will be problematic for the proxy, as it uses a common transport and pools upstream connections (and for HTTP/2 streams) across all inbound connections/requests. The original implementation of the proxy attempted to assign a dedicated transport to each inbound connection however this led to significant TLS overhead due to connection churn (see https:/elastic/cloud/issues/27706) and this was removed. We can put in something like https:/elastic/cloud/pull/75813#issuecomment-782479665 to segregate gRPC traffic in the short term, though we'd probably need to have a think on this in more detail cc @ralphm |
Hi @axw @cyrille-leclerc , is there a way to configure ECE´s APM instances to accept Jaeger protocol right now? (Not both or on the same protocol) thank you |
I don't believe there is as routing the separate port isn't hooked up. elastic/apm#212 covers both ESS and ECE availability for this. |
Currently the server can be configured to listen for requests from Elastic APM agents and Jaeger agents, but only on separate ports. To enable simplified deployment, we should consider serving them on a single port -- at least as an option. To do that, we could use something like https:/soheilhy/cmux
The text was updated successfully, but these errors were encountered: