Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[USDT] Unified Tracing API #10

Closed
Qard opened this issue Mar 20, 2015 · 18 comments
Closed

[USDT] Unified Tracing API #10

Qard opened this issue Mar 20, 2015 · 18 comments
Labels

Comments

@Qard
Copy link
Member

Qard commented Mar 20, 2015

I'm moving the discussion of unifying the native tracing systems to one API from nodejs/node#729 to here. @thekemkid and I are interested in seeing this go somewhere.

There seems to be rough consensus on a macro-based solution something like:

  • REPORT(int subsystem, int eventType, args...) to report events in, when enabled
  • INSTRUMENT(callback(int subsystem, int eventType, args...)) to listen for these events and process them in some way.

What remains unknown is:

  • Is there any prototyping effort underway?
  • What event types are people interested in?
  • What data do people hope to get in args?
@trevnorris
Copy link

Does this have to reported immediately, or can the results be reported asynchronously?

@Qard
Copy link
Member Author

Qard commented Mar 23, 2015

I think it should probably get to the INSTRUMENT(...) callback immediately, but what happens from there depends on what the module wants to do with that data.

Why? Do you have some particular ideas/concerns in mind?

@trevnorris
Copy link

@Qard Simply that the impl going into chromium stores data in a ring buffer and is read out from another thread periodically. so you can get statistics, but it's not immediate. I think we should properly define if the user should expect to receive results at the moment of execution, or at some point in the future.

BTW: the later is much better for performance.

@sam-github
Copy link

On Mon, Mar 23, 2015 at 3:53 PM, Trevor Norris [email protected]
wrote:

@Qard https:/Qard Simply that the impl going into chromium
stores data in a ring buffer and is read out from another thread
periodically. so you can get statistics, but it's not immediate. I think we
should properly define if the user should expect to receive results at the
moment of execution, or at some point in the future.

The chrome devs were pretty definite in their feedback about this: they'd
tried immediate availability, and found it worked badly.

@Qard
Copy link
Member Author

Qard commented Mar 24, 2015

For my uses, as long as the data has reasonably accurate timestamps, async should be fine. I'm not too familiar with dtrace/etw/lttng to know if those require sync events.

It might make sense for INSTRUMENT(...) to instead take a class instance that receives sync events and does some minor operations on them, but then defers the actual reporting part to a worker thread.

@trevnorris
Copy link

@Qard The timestamps are all recorded at the time they're recorded. It's that the data isn't aggregated/processed until some time in the future.

@billti
Copy link

billti commented Sep 1, 2015

Is there any progress happening on this effort? Some good tracing would be really helpful both to debugging Node itself, as well as potential for instrumenting user-side code.

I'm working on the Node Tools for Visual Studio and just hacked in some tracing to help troubleshoot a hairy debugger/networking race issue, and it was invaluable. The current tracing is quite limited and non-trivial to understand. I'm a big believer in good diagnostics though so would be happy to help make this something well supported in the product.

I'm quite familiar with ETW on Windows, but not so much on DTrace or the Linux options. I'd be happy to help out on a consistent cross-platform design and any Windows-specific implementation challenges.

I've seen a couple different threads on different repos for this. Let me know where this is active and if I can help. Thanks!

@Qard
Copy link
Member Author

Qard commented Sep 2, 2015

Unfortunately, no. Not a lot has happened on this. I would absolutely appreciate help getting this moving though. This is still a thing I care deeply about, but I only have so much time and ability to work on this alone. Perhaps we need to have another meeting to introduce the new folks such as yourself to the group and to facilitate some forward movement on these ideas?

@pmuellr
Copy link
Contributor

pmuellr commented Sep 2, 2015

I'm a newcomer here, lurking because I'm interested. Would be interested in another meeting.

Bill, wonder if you would be willing to do a little write-up on your hacked-in tracing for node/win. Assuming you use ETW, I'm curious to see what you did, how you consumed the trace data, what the trace providers/data/etc were, ...

@trevnorris
Copy link

IIRC @ofrobots said we could pull in the tracing API for experimentation. Don't remember the specifics though of how to start that process.

@billti
Copy link

billti commented Sep 2, 2015

@pmuellr Were you interested in just the implementation? Or an actual long form write-up of how to use ETW (perhaps specifically in the context of Node)?

For my diagnostics/experimentation, 2 commits may be of interest.

  • billti/node@3823c75 adds a way to call the generic EventWriteString method, exposed internally as DTrace_Msg. This wouldn't be how you'd add events for Node itself, but I could imagine exposing something like this for apps to use (e.g. process.trace or some such, wired up as the existing trace event calls are in InitDTrace). The events emitted will have the same ID, but the severity level and keyword (a bitmask) can be set when logging the event, and filtered on in the tooling.
  • billti/node@69d5522 adds a specific event to the manifest (logging when the debug agent is started and on what port), along with a description, data structure, keyword, level, etc... This is how you'd typically add "well known" events to Node itself. By adding them to the manifest you can efficiently pack data for events in binary form, and anyone can query on what events, keywords, etc... are available, and tools can query and format the log in a rich manner. (See Writing a manifest if interested in what is available here).

From what little I know about DTrace etc..., I believe similar capabilities exist for specifying levels, keywords/categories, etc... Within Node, the platform differences are abstracted away via header files or macros conditionally included at the top of node_dtrace.cc.

For my specific scenario (race condition between launching node from VS and attaching to the debug port) I just quickly used the generic tracing above to log the debug agent activities, added similar tracing on the Node Tools for VS side, and then captured a trace from both these providers plus the network driver using PerfView. As timing to the sub-millisecond may have been involved, the equivalent of console.log/printf across processes isn't granular or accurate enough to correlate. With this combined trace I could see exactly when the agent & libuv loop were started, when sockets were being opened/bound to, the ACK/RST packets at the TCP level, the debugger payload going back and forth over the connection, etc...

Sorry for running a little long on this thread. I've conflated both potential questions above (the implementation details in Node, and the tools used for analysis). Both are tricky to understand, but super powerful when used properly, and probably worth a blog post on their own.

@pmuellr
Copy link
Contributor

pmuellr commented Sep 2, 2015

@billti thanks for the info - no problem going long - nice to see a practical example

@ofrobots
Copy link
Contributor

ofrobots commented Sep 2, 2015

I posted some details about the trace-event API used by blink on this issue: #15

@joshgav
Copy link
Contributor

joshgav commented Feb 6, 2017

related: #83

@AndreasMadsen AndreasMadsen changed the title Unified Tracing API [USDT] Unified Tracing API Sep 22, 2017
@gireeshpunathil
Copy link
Member

should this issue remain open? [ I am trying to chase dormant issues to closure ] @Qard

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@github-actions github-actions bot added the stale label Jul 19, 2020
@Qard
Copy link
Member Author

Qard commented Jul 19, 2020

I think it's still relevant, though the code has changed a bunch since the discussion. Might be worth a bit of house cleaning discussion at the next diag meeting to identify which issues are still relevant and if there's any interest in working on them to justify leaving the issues open. 🤔

@mmarchini
Copy link
Contributor

I'm triaging all stale issues today, removing the label from those I think are still relevant, closing those I'm mostly sure aren't, and leaving the label on those which I'm not sure.

This one IMO has been incorporated in the Unified Tracing TSC Strategic Initiative, so it can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants