Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate serving Elastic APM and Jaeger inputs on the same port #3984

Closed
axw opened this issue Jul 16, 2020 · 19 comments · Fixed by #4618
Closed

Investigate serving Elastic APM and Jaeger inputs on the same port #3984

axw opened this issue Jul 16, 2020 · 19 comments · Fixed by #4618
Assignees
Milestone

Comments

@axw
Copy link
Member

axw commented Jul 16, 2020

Currently the server can be configured to listen for requests from Elastic APM agents and Jaeger agents, but only on separate ports. To enable simplified deployment, we should consider serving them on a single port -- at least as an option. To do that, we could use something like https:/soheilhy/cmux

@axw
Copy link
Member Author

axw commented Jul 26, 2020

In theory we could use https://godoc.org/google.golang.org/grpc#Server.ServeHTTP, but by all accounts it is not well supported, and the grpc-go maintainers actively discourage its use. Seems to me that would be better than cmux as it would provide an easy way to switch on Content-Type to divert to a gRPC handler. Alas.

If we create a cmux around a TLS listener, than the tls.Conn gets wrapped and net/http.Server will not set req.TLS, and will not trigger the TLSNextProto handler: https:/golang/go/blob/8696ae82c94f0a7707cbbbdf2cec44e93edf5b23/src/net/http/server.go#L1810

That means net/http's transparent HTTP/2 support would be broken, and we would have to explicitly handle the case the negotiated protocol is "h2" but not gRPC. e.g. Go's net/http client, which transparently negotiates "h2" by default.

It would be possible to create an in-memory reverse proxy for http2, muxing on content-type. We could do this by defining our own TLSNextProto handler that reads/buffers http2 frames until we have the headers, similar to cmux, and then replaying buffered frames to either grpc.Server or http2.Server, depending on the content-type. I haven't seen this approach in the wild, so it could be a crackful idea. Overhead might also be an issue. For gRPC we would need to reimplement https:/grpc/grpc-go/blob/9106c3fff5236fd664a8de183f1c27682c66b823/credentials/tls.go#L114, to work with an existing tls.Conn.

@jalvz
Copy link
Contributor

jalvz commented Aug 4, 2020

To simplify things, we should...

Can you elaborate a bit on the concrete motivation for this? What exactly is to be simplified, etc.

@axw
Copy link
Member Author

axw commented Aug 5, 2020

Can you elaborate a bit on the concrete motivation for this? What exactly is to be simplified, etc.

One ~concrete example: sharing a single port means being able to reuse firewall and reverse proxy rules. This is not about simplifying APM Server, but its deployment.

@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented Aug 13, 2020

Would it make sense to preemptively ask the cloud team to allocate additional ports for each cluster in order to add additional endpoints/protocols in the future if multiplexing is clumsy? Allocating additional port would also mean an update of our docs so that users will have clear guidance to setup their firewalls including preparing these reserved ports.

The change of firewall rules is definitively a pain we want to avoid but, for a transition and for a new protocol, I think we can afford this.

@axw
Copy link
Member Author

axw commented Aug 14, 2020

I believe @graphaelli is going to do that in the near future.

@axw axw self-assigned this Aug 24, 2020
@axw
Copy link
Member Author

axw commented Aug 25, 2020

Expanding a bit on #3984 (comment), since others may look to this issue for a summary:

The biggest technical challenges we have with using cmux (or something like it) are:

  1. differing needs for client TLS certificate auth across the multiplexed listeners
  2. serving both gRPC and plain old HTTP/2 auto-negotiated through ALPN

Re (1): because of RUM we default to neither requiring nor requesting TLS client certificates. Even if we request (but don't require) then RUM will prompt the browser user, which is an awful user experience. If we don't request, then we can't rely on TLS client certificate auth for other clients. This is important for Jaeger, where TLS client certificates are the primary auth mechanism.

Re (2): we can address this by introducing a sort of in-memory HTTP/2 reverse proxy, as mentioned above. cmux does some of this internally [0], but I'm not convinced it's complete. There's a fair bit of complexity here, and something I would prefer to avoid at least in the near future.

[0] https:/soheilhy/cmux/blob/8a8ea3c53959009183d7914522833c1ed8835020/matchers.go#L220

Even if we can mux multiple listeners on one port, I still don't love the idea of muxing Jaeger, OpenTelemetry, and our own agent protocol. There's a potential for namespace collision there, between URL paths and gRPC method names (some of which in Jaeger and OpenTelemetry are quite generic).

EDIT: the method names are generic, but the service names are not. My mistake, there's no concern here.

@franekrichardson
Copy link

btw on 1) with the proxy acting as a TLS mitm, it would need changes to support passing through client certs for auth if thats something that's on your roadmap.
We are looking at implementing support for something similar with stack PKI auth, so it would be worth understanding APM's asks around that so that we can cover both.

also for the following:

That means net/http's transparent HTTP/2 support would be broken, and we would have to explicitly handle the case the negotiated protocol is "h2" but not gRPC. e.g. Go's net/http client, which transparently negotiates "h2" by default.

I would expect all connections coming from the cloud proxy will be HTTP/2 since https:/elastic/cloud/pull/61959 was merged.

@cyrille-leclerc
Copy link
Contributor

FYI related slack channel #proj-elastic-intake-protocols

@axw
Copy link
Member Author

axw commented Sep 23, 2020

It would be possible to create an in-memory reverse proxy for http2, muxing on content-type. We could do this by defining our own TLSNextProto handler that reads/buffers http2 frames until we have the headers, similar to cmux, and then replaying buffered frames to either grpc.Server or http2.Server, depending on the content-type. I haven't seen this approach in the wild, so it could be a crackful idea. Overhead might also be an issue. For gRPC we would need to reimplement https:/grpc/grpc-go/blob/9106c3fff5236fd664a8de183f1c27682c66b823/credentials/tls.go#L114, to work with an existing tls.Conn.

I spent a little time looking into this option: https:/axw/apm-server/blob/mux-grpc/beater/internal/h2mux/mux.go

I have not tested the overhead yet. I expect it will be reasonable except per for very short-lived connections, but I would like to test it out. Aside from this, what remains is to implement muxing when TLS is disabled. The above relies on ALPN (a TLS feature). When TLS is disabled, Jaeger can still be used in h2c (HTTP/2 Cleartext) mode. In that case we can probably just use cmux as-is.

Assuming all that goes well, I suggest the following path forward:

  • deprecate apm-server.jaeger.grpc (Unless we want to provide an option to run them on separate ports? Seems more trouble than it's worth to me.)
  • predefine auth tags for passing secret token API Key: elastic-secret-token, elastic-api-key. These would replace apm-server.jaeger.auth_key
  • make auth required for Jaeger when secret token or API Key auth is enabled in the server

I think we should also consider deprecating apm-server.jaeger.http and removing it in 8.0. It's really only useful in testing, and complicates the product.

@axw
Copy link
Member Author

axw commented Sep 24, 2020

I have not tested the overhead yet. I expect it will be reasonable except per for very short-lived connections, but I would like to test it out. Aside from this, what remains is to implement muxing when TLS is disabled. The above relies on ALPN (a TLS feature). When TLS is disabled, Jaeger can still be used in h2c (HTTP/2 Cleartext) mode. In that case we can probably just use cmux as-is.

I ended up doing something a bit simpler than using cmux: instead, we can rely on net/http's Hijacker interface to identify h2c prior knowledge requests, and send them straight through to the gRPC server. This makes the code for handling TLS/non-TLS neater.

Side note: none of this will help us if we ever want to mux non-HTTP based protocols.

Microbenchmarks show that the the overhead is pretty marginal for long-lived connections, a bit more pronounced for short-lived connections:

BenchmarkGRPCWithTLS/longrunning/direct-12         51954            115684 ns/op            8895 B/op        184 allocs/op
BenchmarkGRPCWithTLS/longrunning/muxed-12          51192            116640 ns/op            8853 B/op        183 allocs/op
BenchmarkGRPCWithTLS/shortlived/direct-12           4848           1384871 ns/op          336181 B/op       1948 allocs/op
BenchmarkGRPCWithTLS/shortlived/muxed-12            4327           1570301 ns/op          406596 B/op       2022 allocs/op
BenchmarkGRPCInsecure/longrunning/direct-12        49038            120205 ns/op            8732 B/op        179 allocs/op
BenchmarkGRPCInsecure/longrunning/muxed-12         50451            121998 ns/op            8746 B/op        179 allocs/op
BenchmarkGRPCInsecure/shortlived/direct-12         12973            445460 ns/op          230320 B/op        502 allocs/op
BenchmarkGRPCInsecure/shortlived/muxed-12          12391            480576 ns/op          245128 B/op        536 allocs/op

The "longrunning" benchmarks create a single connection and perform a series of RPCs. The "shortlived" benchmarks create a series of connections over which we perform a single RPC.

@cyrille-leclerc
Copy link
Contributor

@axw would it be interesting to see how the Opentelemetry Collector Receiver for the OTLP protocol multiplexes gRPC and HTTP on the same endpoint? See https:/open-telemetry/opentelemetry-collector/tree/master/receiver/otlpreceiver#writing-with-httpjson

@axw
Copy link
Member Author

axw commented Sep 29, 2020

@cyrille-leclerc opentelemetry-collector does not serve HTTP and gRPC on a single port; you must define two different "endpoints" (host:port) for them. It previously used cmux, but it was removed because it prevented mTLS. See open-telemetry/opentelemetry-collector#1256

(I intend to extract my code into a separate repo, and offer it as a solution to that issue.)

@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented Sep 29, 2020

Good catch, I didn't connect the dots between the collector receiver for the Otel Protocol and the conversation we saw about the removal of multiplexing of HTTP and gRPC.
It's a great idea to extract the multi-transport endpoint code.

@axw
Copy link
Member Author

axw commented Jan 12, 2021

master...axw:beater-gmux is functional, needs unit and system tests.

I've done some basic manual testing, excluding configuring TLS.

  • Starting apm-server on port 8200 (default), jaeger-agent can connect and send data to APM Server when run with --reporter.grpc.host-port=localhost:8200 --reporter.grpc.tls.enabled=false.
  • After setting a secret token in apm-server.yml, jaeger-agent starts logging errors when attempting to send data about being unauthenticated. Adding --agent.tags "elastic-apm-auth=Bearer secret_token_value" to the jaeger-agent args makes it work again.

@bleskes
Copy link

bleskes commented Feb 21, 2021

@axw a lot of the above goes above my head - so apologies if I could have figured this one out. Can you clarify what the connection re-use semantics are? I.e., can the same connection be used for both apm server and jaeger requests?

@axw
Copy link
Member Author

axw commented Feb 22, 2021

@bleskes no worries, it's not obvious. At the moment we claim an entire connection as either gRPC or not-gRPC. So no, it would not work to reuse the connection for both Jaeger and Elastic APM protocols. Sorry, I think I missed the point of your earlier questions around this.

@franekrichardson
Copy link

this will be problematic for the proxy, as it uses a common transport and pools upstream connections (and for HTTP/2 streams) across all inbound connections/requests. The original implementation of the proxy attempted to assign a dedicated transport to each inbound connection however this led to significant TLS overhead due to connection churn (see https:/elastic/cloud/issues/27706) and this was removed.

We can put in something like https:/elastic/cloud/pull/75813#issuecomment-782479665 to segregate gRPC traffic in the short term, though we'd probably need to have a think on this in more detail cc @ralphm

@christianherweg0807
Copy link

Hi @axw @cyrille-leclerc , is there a way to configure ECE´s APM instances to accept Jaeger protocol right now? (Not both or on the same protocol)

thank you
Christian

@graphaelli
Copy link
Member

I don't believe there is as routing the separate port isn't hooked up. elastic/apm#212 covers both ESS and ECE availability for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants