CO-RE and SubprocessMonitor #44
If you already familiar with CO-RE, you can skip this section.
A BPF program is a piece of user-provided code which is injected straight into a kernel. Inevitably, a dependency on kernel data structures will be introduced. Unfortunately, the data structure of the kernel is not set in stone.
In such cases, how do you make sure you are not reading garbage data when some kernel added an extra field before the field you thought is, say, at offset 16 from the start of
struct task_struct
? Suddenly, for that kernel, you'll need to read data from, e.g., offset 24. And the problems don't end there: what if a field got renamed, as was the case withthread_struct
'sfs
field (useful for accessing thread-local storage), which got renamed tofsbase
between 4.6 and 4.7 kernels.
All this means that you can no longer compile your BPF program locally using kernel headers of your dev server and distribute it in compiled form to other systems, while expecting it to work and produce correct results. This is because kernel headers for different kernel versions will specify a different memory layout of data your program relies on.
One possible solution is to compile the BPF program in real-time on the actual running machine. This is what BCC did and our BccTracer
is based on it. However, this approach has a few drawbacks:
- Clang/LLVM combo is a big library, resulting in big fat binaries that need to be distributed with your application.(We do suffer from this problem.)
- Clang/LLVM combo is resource-heavy, so when you are compiling BPF code at start up, you'll use a significant amount of resources, potentially tipping over a carefully balanced production workfload. And vice versa, on a busy host, compiling a small BPF program might take minutes in some cases.(Yes, it's slowing down our startup and causes a large performance impact when popping containers at scale.)
- You are making a big bet that the target system will have kernel headers present, which most of the time is not a problem, but sometimes can cause a lot of headaches. (Yes, see how-to/run-with-docker)
- BPF program testing and development iteration is quite painful as well. (Yes, we haven't found a good way to test BPF programs:#47.)
BPF CO-RE, by making the kernel self-contained information, together with a BPF loader, allows BPF programs to be compiled offline, and then loaded into the kernel without the need to recompile them on the target system. This is a huge improvement over the status quo.
BPF CO-RE requires the following parts:
- BTF type information, which allows to capture crucial pieces of information about kernel and BPF program types and code, enabling all the other parts of BPF CO-RE puzzle;
- BPF loader (libbpf) ties BTFs from kernel and BPF program together to adjust compiled BPF code to specific kernel on target hosts;
- kernel, while staying completely BPF CO-RE-agnostic, provides advanced BPF features to enable some of the more advanced scenarios.
There are three language have its own CO-RE implementation:
- C: libbpf
- Rust: aya-rs
- Go: cilium/ebpf, it transforms C code to Go code and then compile it.
Feature | Kernel version | Commit |
---|---|---|
BPF Type Format (BTF) | 4.18 | 69b693f0aefa |
Note:
- If one
SubprocessTracer
not based onCO-RE
, it will not require BTF.
SubprocessMonitor
will provide support for subprocess-based tracer.Its primary goal is to support the CO-RE program. This is because CO-RE programs are often written through the underlying language(C/Rust). It's hard to write a CO-RE program in Python(CPython may support, but it's not a good idea). And It's hard for C/Rust to write a duetector
-like program. So SubprocessMonitor
will provide a way to write a CO-RE program in C/Rust and use it in Python. Instead of writing cumbersome programs in userspace, C/Rust users simply communicate with SubprocessMonitor
through a protocol.
Furthermore, SubprocessMonitor
also support non-BPF program, as long as it can be run in a subprocess and obey the protocol.
This is a very early experimental Protocol with a high probability of extreme variability, and at the same time, many of the fields are undefined
{
"proto": "duetector", // or opentelemetry if type is event
"type": "" // init, event, stop, stopped
"version": "0.1.0", // protocol version
"payload":{}
}
type: init
payload:
poll_time
: unit: secondskill_timeout
: unit, secondsconfig
: object, config of tracer
type: init
payload:
name
: string, name of tracerversion
: string, version of tracer
type: event
payload:
type: event
payload: object of tracking data or OpenTelemetry OTLP things
type: stop
payload:
type: stopped
payload:
After we intro OpenTelemetry in #25, we can support OpenTelemetry protocol to passthrough the Event
message.
- When a
SubprocessTracer
attatch
to theSubprocessHost
, host willPopen
a subprocess and sendinit
message to subprocess. - After the subprocess is started, subprocess will send a
init
message to host. - Subprocess will send
Event
message to host when it has something to report. - Host will poll the subprocess's stdout and stderr periodically and send
Event
message to subprocess. - When
detach
is called, host will send SIGTERM andstop
message to subprocess and wait for subprocess to exit, if subprocess doesn't exit inkill_timeout
seconds, host will send SIGKILL to subprocess. - When subprocess stops, it will send a
stopped
message to host.
In this case, subprocess becomes an orphan process. It will be adopted by init
process and continue to run. Subprocess will no longer receive Event
message from host. No message is received within twice or more poll time``, subprocess can be assumed that the host process has crashed. Subprocess will send a
stopped` message to host and exit.
In this case, host can check subprocess is alive or not. If subprocess is not alive, host should log an warming
for every poll.
Host can be configured to restart subprocess when subprocess crashes for a certain number of times.
- Reimplement
Tracer
in C/Rust/Go - Commit code to /CORE
- Implement
SubprocessTracer
induetector
Note:
- Need to add the corresponding Github Action to compile the code and upload the binary to Github Release.
- Need to change our docker image to include the binary.
- Support
SubprocessTracer
template inTracerManager