Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add WebSockets metrics #2649

Merged
merged 12 commits into from
Aug 15, 2024
Merged

feat: add WebSockets metrics #2649

merged 12 commits into from
Aug 15, 2024

Conversation

SgtPooki
Copy link
Member

@SgtPooki SgtPooki commented Aug 5, 2024

Title

feat: add WebSockets metrics

Description

Creates two new metric counter groups for WebSockets: libp2p_websockets_dialer_events_total and libp2p_websockets_listener_events_total.

The following new metrics should be added:

> rg '.increment|.update' packages/transport-websockets/src
packages/transport-websockets/src/index.ts
157:      this.metrics?.dialerEvents.increment({ error: true })
166:        this.metrics?.dialerEvents.increment({ abort: true })
168:        this.metrics?.dialerEvents.increment({ error: true })
179:    this.metrics?.dialerEvents.increment({ connect: true })

packages/transport-websockets/src/socket-to-conn.ts
64:        metrics?.increment({ [`${metricPrefix}close_abort`]: true })
73:        metrics?.increment({ [`${metricPrefix}close_success`]: true })
76:        metrics?.increment({ [`${metricPrefix}close_error`]: true })
95:    metrics?.increment({ [`${metricPrefix}close`]: true })

packages/transport-websockets/src/listener.ts
94:              this.metrics?.errors.increment({ [`${this.addr} inbound_upgrade`]: true })
104:            this.metrics?.errors.increment({ [`${this.addr} inbound_closing_failed`]: true })
140:        this.metrics?.status.update({
147:      this.metrics?.errors.increment({ [`${this.addr} listen_error`]: true })
153:      this.metrics?.status.update({

This aligns pretty closely to TCP metrics now:

> rg '.increment|.update' packages/transport-tcp/src
packages/transport-tcp/src/index.ts
224:        this.metrics?.dialerEvents.increment({ error: true })
231:        this.metrics?.dialerEvents.increment({ timeout: true })
240:        this.metrics?.dialerEvents.increment({ connect: true })
246:        this.metrics?.dialerEvents.increment({ abort: true })

packages/transport-tcp/src/socket-to-conn.ts
64:    metrics?.increment({ [`${metricPrefix}timeout`]: true })
79:    metrics?.increment({ [`${metricPrefix}close`]: true })
93:    metrics?.increment({ [`${metricPrefix}end`]: true })

packages/transport-tcp/src/listener.ts
154:          this.metrics?.status.update({
162:        this.metrics?.errors.increment({ [`${this.addr} listen_error`]: true })
166:        this.metrics?.status.update({
187:      this.metrics?.events.increment({ [`${this.addr} error`]: true })
202:      this.metrics?.errors.increment({ [`${this.addr} inbound_to_connection`]: true })
251:          this.metrics?.errors.increment({ [`${this.addr} inbound_upgrade`]: true })
268:          this.metrics?.errors.increment({ [`${this.addr} inbound_closing_failed`]: true })

Fixes #1915

Notes & open questions

Callouts

  • index.ts
    • There is no timeout metric because we don't have a socket timeout event already.
  • socket-to-conn.ts
    • There is no timeout metric for websockets because we don't have a socket timeout event already.
    • No end metric
    • Some metrics are prefixed with close_ because we don't have general abort/close/end handlers already.
  • listener.ts
    • No inbound_to_connection metric for websockets because socketToMaConn doesn't throw.
    • No events.increment({ [${this.addr} error] for websockets because there is no existing socket.on('error' handler.

TODO based on comments:

Some questions:

  • Do we want less events? These seem like a lot but I wanted to throw something together.
  • Do we want more listener events in the server.ts instead of in listener.ts ?
  • Is the socket_close_success in socket-to-conn.ts called even after an abort event? we may want to move it if so.
  • Do we want an error event for WS listener error events? i.e. in this.server.on('error', (err: Error) => {
  • For the listener, a metricPrefix is added which is currently set to ${listeningAddrDetails?.transport}_${listeningAddrDetails?.host}:${listeningAddrDetails?.port} because WebSocket listener may be listening on /tls/ws or /wss or /ws -- Do we need .transport ?
  • Is PR comments clarifying metrics and usage enough or is there another place I need to change? https:/libp2p/js-libp2p/blob/main/doc/METRICS.md doesn't list out metrics and explanations, so I figured this PR would be a good placeholder at the least.
  • Do we need to add any tests? I didn't see any in TCP: https:/search?q=repo%3Alibp2p%2Fjs-libp2p+path%3A%2F%5Epackages%5C%2Ftransport-tcp%5C%2Ftest%5C%2F%2F+metric&type=code

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation if necessary (this includes comments as well)
  • I have added tests that prove my fix is effective or that my feature works

@SgtPooki SgtPooki requested a review from a team as a code owner August 5, 2024 21:05
@SgtPooki SgtPooki self-assigned this Aug 5, 2024
@SgtPooki
Copy link
Member Author

SgtPooki commented Aug 6, 2024

seems like webrtc firefox test is failing: https:/libp2p/js-libp2p/actions/runs/10256017997/job/28381208851?pr=2649#step:5:2369, which shouldn't be related to these changes at all

@achingbrain
Copy link
Member

The Firefox thing is documented here - #2642 - unrelated to this PR, as you say.

@achingbrain
Copy link
Member

I think the various "start" events might be unnecessary? That is, we care about successes vs error/abort, the sum of which should be equal to the starts, so we can derive that metric if we need it but it doesn't seem particularly useful on its own?

For the maconn metrics, perhaps we should follow the pattern established by @libp2p/tcp - the metric names have a prefix which is the listening address for the listener or omitted (e.g. defaults to an empty string) for the dialer.

A good starting point apart from that, I think.

Copy link
Member

@achingbrain achingbrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments above

@SgtPooki
Copy link
Member Author

SgtPooki commented Aug 9, 2024

I think the various "start" events might be unnecessary? That is, we care about successes vs error/abort, the sum of which should be equal to the starts, so we can derive that metric if we need it but it doesn't seem particularly useful on its own?

The main thing I was thinking here was that with https:/libp2p/js-libp2p-amino-dht-bootstrapper, I ran into some issues where dials were completely failing and being closed early and I couldn't find them in the logs or events. If we don't have an event for first-touch on the receiving end, we don't have an easy way to confirm that the error occurred at the networking/docker/etc layer above js-libp2p.

On the other hand, if we do have a start event and nothing else (totally possible if OS drops that socket for whatever reason), we can investigate there further.. e.g. A lot of starts and no other events for the transport could indicate instability at the host layer.

If we have ZERO starts, but we know we tried to dial it.. we know libp2p never got the connection.

These may not be enough of a reason to emit all these metrics all the time, but that's what was in my mind while creating this PR.

I can remove the various "starts" if you feel like they're still unnecessary. let me know!


I will add a metricsPrefix and other related metrics changes to socketToMaConn to more closely match https:/libp2p/js-libp2p/blob/0edbfe7af1ccf4bd23dd78b2bcc29ecf54ea02eb/packages/transport-tcp/src/socket-to-conn.ts

@achingbrain
Copy link
Member

achingbrain commented Aug 9, 2024

The main thing I was thinking here was that with libp2p/js-libp2p-amino-dht-bootstrapper, I ran into some issues where dials were completely failing and being closed early and I couldn't find them in the logs or events. If we don't have an event for first-touch on the receiving end, we don't have an easy way to confirm that the error occurred at the networking/docker/etc layer above js-libp2p.

This sounds a bit weird. If our handler isn't being invoked but the connection is accepted by the server we should see the number of active handles increase. If the connection succeeds we should see the "success" metric increase, if it errors or times out we should see the "error" or "aborted" metrics increase. There shouldn't be any other scenarios?

If connections are not making it to the app it's probably a misconfiguration of ports in the docker file?

I'd say take the "start"s out for now, we can always add them later if they become necessary.

remove:
* upgrade_start
* socket_open_start
* maconn_open_start
* maconn_close_start

add:
* socket_open_error
@SgtPooki
Copy link
Member Author

@achingbrain starts removed in 05bb4dd (#2649).

NOTE: added socket_open_error. updating PR description now

added metricPrefix to `socket-to-conn.ts` and set in `listener.ts`

remove:
* maconn_abort - would have resulted in duplicate of `maconn_close_abort`
* maconn_open_success - No logic actually happens during maconn creation.. should be build/compile/run error and not a metric.

add: N/A

change:
* socket_close_success -> maconn_socket_close_success
Copy link
Member Author

@SgtPooki SgtPooki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review

packages/transport-websockets/src/index.ts Outdated Show resolved Hide resolved
packages/transport-websockets/src/index.ts Outdated Show resolved Hide resolved
packages/transport-websockets/src/listener.ts Outdated Show resolved Hide resolved

await maConn.close().catch(err => {
this.log.error('inbound connection failed to close after upgrade failed', err)
})
})
metrics?.increment({ socket_open_success: true })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could fire after the .catch handler. should remove

@@ -91,6 +100,7 @@ export function socketToMaConn (stream: DuplexWebSocket, remoteAddr: Multiaddr,
if (maConn.timeline.close == null) {
maConn.timeline.close = Date.now()
}
metrics?.increment({ [`${metricPrefix}maconn_socket_close_success`]: true })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may not want this one? but based on the comment on lines 97-99, it seems like it could be useful

@@ -57,6 +63,7 @@ export function socketToMaConn (stream: DuplexWebSocket, remoteAddr: Multiaddr,
const { host, port } = maConn.remoteAddr.toOptions()
log('timeout closing stream to %s:%s after %dms, destroying it manually',
host, port, Date.now() - start)
metrics?.increment({ [`${metricPrefix}maconn_close_abort`]: true })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the this.abort(err), that calls stream.destroy will trigger this.. i'm trying to avoid duplicate abort and error events but might have missed something

packages/transport-websockets/src/listener.ts Outdated Show resolved Hide resolved
packages/transport-websockets/src/listener.ts Outdated Show resolved Hide resolved
Comment on lines 143 to 144
const conn = await options.upgrader.upgradeOutbound(maConn, options)
this.metrics?.dialerEvents.increment({ upgrade_success: true })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need an upgrade_error somewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense.

@SgtPooki SgtPooki mentioned this pull request Aug 14, 2024
1 task
@SgtPooki
Copy link
Member Author

i'm currently in the process of updating this PR to much more closely resemble TCP metrics.. including adding a status metric

@SgtPooki
Copy link
Member Author

FYI, take a look at the callouts in PR description and let me know if you want me to add any event handlers. it looks like this transport hasn't been changed much since the migration to typescript, so maybe it could use an overhaul, but I wanted to keep things limited to only adding metrics

Copy link
Member Author

@SgtPooki SgtPooki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review. single callout comment about the WebSocketListenerStatusCode enum

events: CounterGroup
}

enum WebSocketListenerStatusCode {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I removed the paused status that the TCP transport had because it didn't seem to make sense here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listener status was added to the TCP transport as it can shut itself down when under attack and restart later on. This feature isn't in the WebSocket transport so I think we don't need this right now.

We may add it later I guess but let's do that when we do that.

@achingbrain
Copy link
Member

Do we want more listener events in the server.ts instead of in listener.ts ?

I think if we follow TCP as a model then no. We can always add more later.

Is the socket_close_success in socket-to-conn.ts called even after an abort event? we may want to move it if so.

We also listen for the close event emitted by the socket and record a metric, so socket_close_success seems redundant.

Is PR comments clarifying metrics and usage enough or is there another place I need to change? main/doc/METRICS.md doesn't list out metrics and explanations, so I figured this PR would be a good placeholder at the least.

Seems fine. METRICS.md purposefully doesn't list out every single metric as it would almost certainly get out of date really quickly. Maybe it could list some core ones to get people started but I shudder when I think of the maintenance.

Do we need to add any tests?

For completeness probably, and they may be useful in answering questions like "Is the socket_close_success in socket-to-conn.ts called even after an abort event?" but I would be fine without them.


I've pushed a couple of commits, I'd like to get this in.

I removed the socket status as per #2649 (comment)

I also remove the close_abort and close_error events in favour of listening for the close event (error is a bit more complicated as the WebSocket implementation from ws doesn't function the same as node Duplex streams).

The idea is, we listen for events on the socket and log metrics when they occur (same as the TCP transport).

This is a bit different to the in initial PR, which logged metrics after calling methods on the socket. That is, the socket is closed when the close event is emitted, not when we call methods that cause it to close.

Without only relying solely on the events for logging metrics we could end up masking problems caused if we misuse the stream API for whatever reason.

@achingbrain achingbrain merged commit 1dfb74e into main Aug 15, 2024
24 checks passed
@achingbrain achingbrain deleted the feat/ws-metrics branch August 15, 2024 09:07
@achingbrain achingbrain mentioned this pull request Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Expose Metrics for Websockets
2 participants