Investigate serving Elastic APM and Jaeger inputs on the same port #3984

axw · 2020-07-16T00:39:20Z

Currently the server can be configured to listen for requests from Elastic APM agents and Jaeger agents, but only on separate ports. To enable simplified deployment, we should consider serving them on a single port -- at least as an option. To do that, we could use something like https:/soheilhy/cmux

axw · 2020-07-26T07:09:19Z

In theory we could use https://godoc.org/google.golang.org/grpc#Server.ServeHTTP, but by all accounts it is not well supported, and the grpc-go maintainers actively discourage its use. Seems to me that would be better than cmux as it would provide an easy way to switch on Content-Type to divert to a gRPC handler. Alas.

If we create a cmux around a TLS listener, than the tls.Conn gets wrapped and net/http.Server will not set req.TLS, and will not trigger the TLSNextProto handler: https:/golang/go/blob/8696ae82c94f0a7707cbbbdf2cec44e93edf5b23/src/net/http/server.go#L1810

That means net/http's transparent HTTP/2 support would be broken, and we would have to explicitly handle the case the negotiated protocol is "h2" but not gRPC. e.g. Go's net/http client, which transparently negotiates "h2" by default.

It would be possible to create an in-memory reverse proxy for http2, muxing on content-type. We could do this by defining our own TLSNextProto handler that reads/buffers http2 frames until we have the headers, similar to cmux, and then replaying buffered frames to either grpc.Server or http2.Server, depending on the content-type. I haven't seen this approach in the wild, so it could be a crackful idea. Overhead might also be an issue. For gRPC we would need to reimplement https:/grpc/grpc-go/blob/9106c3fff5236fd664a8de183f1c27682c66b823/credentials/tls.go#L114, to work with an existing tls.Conn.

jalvz · 2020-08-04T09:51:45Z

To simplify things, we should...

Can you elaborate a bit on the concrete motivation for this? What exactly is to be simplified, etc.

axw · 2020-08-05T05:28:05Z

Can you elaborate a bit on the concrete motivation for this? What exactly is to be simplified, etc.

One ~concrete example: sharing a single port means being able to reuse firewall and reverse proxy rules. This is not about simplifying APM Server, but its deployment.

cyrille-leclerc · 2020-08-13T14:14:23Z

Would it make sense to preemptively ask the cloud team to allocate additional ports for each cluster in order to add additional endpoints/protocols in the future if multiplexing is clumsy? Allocating additional port would also mean an update of our docs so that users will have clear guidance to setup their firewalls including preparing these reserved ports.

The change of firewall rules is definitively a pain we want to avoid but, for a transition and for a new protocol, I think we can afford this.

axw · 2020-08-14T06:27:20Z

I believe @graphaelli is going to do that in the near future.

axw · 2020-08-25T03:42:21Z

Expanding a bit on #3984 (comment), since others may look to this issue for a summary:

The biggest technical challenges we have with using cmux (or something like it) are:

differing needs for client TLS certificate auth across the multiplexed listeners
serving both gRPC and plain old HTTP/2 auto-negotiated through ALPN

Re (1): because of RUM we default to neither requiring nor requesting TLS client certificates. Even if we request (but don't require) then RUM will prompt the browser user, which is an awful user experience. If we don't request, then we can't rely on TLS client certificate auth for other clients. This is important for Jaeger, where TLS client certificates are the primary auth mechanism.

Re (2): we can address this by introducing a sort of in-memory HTTP/2 reverse proxy, as mentioned above. cmux does some of this internally [0], but I'm not convinced it's complete. There's a fair bit of complexity here, and something I would prefer to avoid at least in the near future.

[0] https:/soheilhy/cmux/blob/8a8ea3c53959009183d7914522833c1ed8835020/matchers.go#L220

Even if we can mux multiple listeners on one port, I still don't love the idea of muxing Jaeger, OpenTelemetry, and our own agent protocol. There's a potential for namespace collision there, between URL paths and gRPC method names (some of which in Jaeger and OpenTelemetry are quite generic).

EDIT: the method names are generic, but the service names are not. My mistake, there's no concern here.

franekrichardson · 2020-08-27T11:38:54Z

btw on 1) with the proxy acting as a TLS mitm, it would need changes to support passing through client certs for auth if thats something that's on your roadmap.
We are looking at implementing support for something similar with stack PKI auth, so it would be worth understanding APM's asks around that so that we can cover both.

also for the following:

That means net/http's transparent HTTP/2 support would be broken, and we would have to explicitly handle the case the negotiated protocol is "h2" but not gRPC. e.g. Go's net/http client, which transparently negotiates "h2" by default.

I would expect all connections coming from the cloud proxy will be HTTP/2 since https:/elastic/cloud/pull/61959 was merged.

cyrille-leclerc · 2020-09-14T14:50:00Z

FYI related slack channel #proj-elastic-intake-protocols

axw · 2020-09-23T08:54:07Z

It would be possible to create an in-memory reverse proxy for http2, muxing on content-type. We could do this by defining our own TLSNextProto handler that reads/buffers http2 frames until we have the headers, similar to cmux, and then replaying buffered frames to either grpc.Server or http2.Server, depending on the content-type. I haven't seen this approach in the wild, so it could be a crackful idea. Overhead might also be an issue. For gRPC we would need to reimplement https:/grpc/grpc-go/blob/9106c3fff5236fd664a8de183f1c27682c66b823/credentials/tls.go#L114, to work with an existing tls.Conn.

I spent a little time looking into this option: https:/axw/apm-server/blob/mux-grpc/beater/internal/h2mux/mux.go

I have not tested the overhead yet. I expect it will be reasonable except per for very short-lived connections, but I would like to test it out. Aside from this, what remains is to implement muxing when TLS is disabled. The above relies on ALPN (a TLS feature). When TLS is disabled, Jaeger can still be used in h2c (HTTP/2 Cleartext) mode. In that case we can probably just use cmux as-is.

Assuming all that goes well, I suggest the following path forward:

deprecate apm-server.jaeger.grpc (Unless we want to provide an option to run them on separate ports? Seems more trouble than it's worth to me.)
predefine auth tags for passing secret token API Key: elastic-secret-token, elastic-api-key. These would replace apm-server.jaeger.auth_key
make auth required for Jaeger when secret token or API Key auth is enabled in the server

I think we should also consider deprecating apm-server.jaeger.http and removing it in 8.0. It's really only useful in testing, and complicates the product.

axw · 2020-09-24T10:13:29Z

I have not tested the overhead yet. I expect it will be reasonable except per for very short-lived connections, but I would like to test it out. Aside from this, what remains is to implement muxing when TLS is disabled. The above relies on ALPN (a TLS feature). When TLS is disabled, Jaeger can still be used in h2c (HTTP/2 Cleartext) mode. In that case we can probably just use cmux as-is.

I ended up doing something a bit simpler than using cmux: instead, we can rely on net/http's Hijacker interface to identify h2c prior knowledge requests, and send them straight through to the gRPC server. This makes the code for handling TLS/non-TLS neater.

Side note: none of this will help us if we ever want to mux non-HTTP based protocols.

Microbenchmarks show that the the overhead is pretty marginal for long-lived connections, a bit more pronounced for short-lived connections:

BenchmarkGRPCWithTLS/longrunning/direct-12         51954            115684 ns/op            8895 B/op        184 allocs/op
BenchmarkGRPCWithTLS/longrunning/muxed-12          51192            116640 ns/op            8853 B/op        183 allocs/op
BenchmarkGRPCWithTLS/shortlived/direct-12           4848           1384871 ns/op          336181 B/op       1948 allocs/op
BenchmarkGRPCWithTLS/shortlived/muxed-12            4327           1570301 ns/op          406596 B/op       2022 allocs/op
BenchmarkGRPCInsecure/longrunning/direct-12        49038            120205 ns/op            8732 B/op        179 allocs/op
BenchmarkGRPCInsecure/longrunning/muxed-12         50451            121998 ns/op            8746 B/op        179 allocs/op
BenchmarkGRPCInsecure/shortlived/direct-12         12973            445460 ns/op          230320 B/op        502 allocs/op
BenchmarkGRPCInsecure/shortlived/muxed-12          12391            480576 ns/op          245128 B/op        536 allocs/op

The "longrunning" benchmarks create a single connection and perform a series of RPCs. The "shortlived" benchmarks create a series of connections over which we perform a single RPC.

cyrille-leclerc · 2020-09-25T14:13:09Z

@axw would it be interesting to see how the Opentelemetry Collector Receiver for the OTLP protocol multiplexes gRPC and HTTP on the same endpoint? See https:/open-telemetry/opentelemetry-collector/tree/master/receiver/otlpreceiver#writing-with-httpjson

axw · 2020-09-29T03:58:25Z

@cyrille-leclerc opentelemetry-collector does not serve HTTP and gRPC on a single port; you must define two different "endpoints" (host:port) for them. It previously used cmux, but it was removed because it prevented mTLS. See open-telemetry/opentelemetry-collector#1256

(I intend to extract my code into a separate repo, and offer it as a solution to that issue.)

cyrille-leclerc · 2020-09-29T12:52:02Z

Good catch, I didn't connect the dots between the collector receiver for the Otel Protocol and the conversation we saw about the removal of multiplexing of HTTP and gRPC.
It's a great idea to extract the multi-transport endpoint code.

axw · 2021-01-12T06:54:24Z

master...axw:beater-gmux is functional, needs unit and system tests.

I've done some basic manual testing, excluding configuring TLS.

Starting apm-server on port 8200 (default), jaeger-agent can connect and send data to APM Server when run with --reporter.grpc.host-port=localhost:8200 --reporter.grpc.tls.enabled=false.
After setting a secret token in apm-server.yml, jaeger-agent starts logging errors when attempting to send data about being unauthenticated. Adding --agent.tags "elastic-apm-auth=Bearer secret_token_value" to the jaeger-agent args makes it work again.

bleskes · 2021-02-21T21:07:17Z

@axw a lot of the above goes above my head - so apologies if I could have figured this one out. Can you clarify what the connection re-use semantics are? I.e., can the same connection be used for both apm server and jaeger requests?

axw · 2021-02-22T01:35:16Z

@bleskes no worries, it's not obvious. At the moment we claim an entire connection as either gRPC or not-gRPC. So no, it would not work to reuse the connection for both Jaeger and Elastic APM protocols. Sorry, I think I missed the point of your earlier questions around this.

franekrichardson · 2021-02-22T10:23:05Z

this will be problematic for the proxy, as it uses a common transport and pools upstream connections (and for HTTP/2 streams) across all inbound connections/requests. The original implementation of the proxy attempted to assign a dedicated transport to each inbound connection however this led to significant TLS overhead due to connection churn (see https:/elastic/cloud/issues/27706) and this was removed.

We can put in something like https:/elastic/cloud/pull/75813#issuecomment-782479665 to segregate gRPC traffic in the short term, though we'd probably need to have a think on this in more detail cc @ralphm

christianherweg0807 · 2021-03-15T12:54:19Z

Hi @axw @cyrille-leclerc , is there a way to configure ECE´s APM instances to accept Jaeger protocol right now? (Not both or on the same protocol)

thank you
Christian

graphaelli · 2021-03-15T13:29:46Z

I don't believe there is as routing the separate port isn't hooked up. elastic/apm#212 covers both ESS and ECE availability for this.

axw added the [zube]: Ready label Jul 23, 2020

axw self-assigned this Aug 24, 2020

axw added [zube]: In Review and removed [zube]: Ready labels Aug 24, 2020

axw added [zube]: In Progress and removed [zube]: In Review labels Sep 21, 2020

axw mentioned this issue Oct 7, 2020

Jaeger Intake on Elastic Cloud elastic/apm#212

Closed

axw added [zube]: Ready and removed [zube]: In Progress labels Nov 11, 2020

axw mentioned this issue Dec 10, 2020

Native OTLP intake #4503

Closed

axw added this to the 7.12 milestone Dec 22, 2020

axw added [zube]: In Progress and removed [zube]: Ready labels Jan 12, 2021

axw mentioned this issue Jan 13, 2021

beater: run gmuxed gRPC server by default #4618

Merged

1 task

axw added [zube]: In Review and removed [zube]: In Progress labels Jan 19, 2021

axw closed this as completed in #4618 Jan 20, 2021

zube bot added [zube]: Done and removed [zube]: In Review labels Jan 20, 2021

axw removed the [zube]: Done label Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate serving Elastic APM and Jaeger inputs on the same port #3984

Investigate serving Elastic APM and Jaeger inputs on the same port #3984

axw commented Jul 16, 2020 •

edited

Loading

axw commented Jul 26, 2020

jalvz commented Aug 4, 2020

axw commented Aug 5, 2020

cyrille-leclerc commented Aug 13, 2020 •

edited

Loading

axw commented Aug 14, 2020

axw commented Aug 25, 2020 •

edited

Loading

franekrichardson commented Aug 27, 2020

cyrille-leclerc commented Sep 14, 2020

axw commented Sep 23, 2020

axw commented Sep 24, 2020

cyrille-leclerc commented Sep 25, 2020

axw commented Sep 29, 2020

cyrille-leclerc commented Sep 29, 2020 •

edited

Loading

axw commented Jan 12, 2021

bleskes commented Feb 21, 2021

axw commented Feb 22, 2021

franekrichardson commented Feb 22, 2021

christianherweg0807 commented Mar 15, 2021

graphaelli commented Mar 15, 2021

Investigate serving Elastic APM and Jaeger inputs on the same port #3984

Investigate serving Elastic APM and Jaeger inputs on the same port #3984

Comments

axw commented Jul 16, 2020 • edited Loading

axw commented Jul 26, 2020

jalvz commented Aug 4, 2020

axw commented Aug 5, 2020

cyrille-leclerc commented Aug 13, 2020 • edited Loading

axw commented Aug 14, 2020

axw commented Aug 25, 2020 • edited Loading

franekrichardson commented Aug 27, 2020

cyrille-leclerc commented Sep 14, 2020

axw commented Sep 23, 2020

axw commented Sep 24, 2020

cyrille-leclerc commented Sep 25, 2020

axw commented Sep 29, 2020

cyrille-leclerc commented Sep 29, 2020 • edited Loading

axw commented Jan 12, 2021

bleskes commented Feb 21, 2021

axw commented Feb 22, 2021

franekrichardson commented Feb 22, 2021

christianherweg0807 commented Mar 15, 2021

graphaelli commented Mar 15, 2021

axw commented Jul 16, 2020 •

edited

Loading

cyrille-leclerc commented Aug 13, 2020 •

edited

Loading

axw commented Aug 25, 2020 •

edited

Loading

cyrille-leclerc commented Sep 29, 2020 •

edited

Loading