Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector agent loses/does not send all metrics when being offline #21410

Open
freak12techno opened this issue Oct 3, 2024 · 1 comment
Open
Labels
type: bug A code related bug.

Comments

@freak12techno
Copy link

freak12techno commented Oct 3, 2024

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We are planning to integrate Vector into one of our projects. Our idea is to have a architecture where there are multiple servers, which all are sending data to a server Vector, which is sending data to Prometheus remote write. Problem is, if a machine Vector agent is running on is offline for some period of time (and it's the often case for us), Vector would lose some metrics.
This happens almost all the times a machine Vector agent is running at is losing its internet access.
Example:
изображение
Here I've disabled WiFi on my laptop I am running my agent on from 18:17 to 12:55, and there's a gap in all metrics between ~19:57 and ~12:57, so it effectively lost almost all the metrics.

I tried writing to Prometheus remote write directly from agent instead of writing it to server Vector, and it yielded the same result, so it doesn't seem like server Vector is the issue, it seems like there's some problem with Vector agent either not collecting metrics once it's offline, or sending it in a wrong way so they are not recorded in Prometheus.

This is critical for us, and we wonder if it's us who misconfigured something, or is there some kind of a bug in Vector agent that causes this.

Configuration

timezone = "UTC"

[sources.node_metrics]
type = "host_metrics"
namespace = "node"
scrape_interval_secs = 5

[sources.internal_metrics]
type = "internal_metrics"

[transforms.add_serial_metrics]
type = "remap"
inputs = ["node_metrics", "internal_metrics"]
source = """
.tags.host = get_env_var!("SERIAL")
.tags.hardware = "sierra"
"""


[sinks.vector_metrics]
type = "vector"
healthcheck.enabled = false
inputs = [ "add_serial_metrics" ]
address = "x.y.z.a:bbb"
compression = true
buffer.type = "disk"
buffer.max_size = 268435488
request.retry_max_duration_secs = 60

# Also tried this, to write to Prometheus directly, same result.
#[sinks.vector_metrics2]
#type = "prometheus_remote_write"
#healthcheck.enabled = false
#inputs = [ "add_serial_metrics" ]
#endpoint = "https://x.y.z.a:bbb/api/v1/write"
#auth = { strategy = "bearer", token = "xxx" }

Version

FROM timberio/vector:0.41.1-alpine

Debug Output

Once the internet is out, here's what happening in logs (a lot of repeated messages like this):

2024-10-02T15:50:14.719369Z  WARN sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=489}: vector::sinks::util::retries: Retrying after error. error=Request failed: status: Unavailable, message: "error trying to connect: tcp connect error: Connection refused (os error 111)", details: [], metadata: MetadataMap { headers: {} } internal_log_rate_limit=true
2024-10-02T15:50:14.719382Z DEBUG sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=489}: vector::sinks::util::retries: Retrying request. delay_ms=40657
2024-10-02T15:50:15.365873Z DEBUG transform{component_kind="transform" component_id=add_serial_metrics component_type=remap}: vector::utilization: utilization=0.0003001317441953306
2024-10-02T15:50:24.831105Z DEBUG sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=398}: hyper::client::connect::http: connecting to x.y.z.a:bbb
2024-10-02T15:50:24.833180Z  WARN sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=398}: vector::sinks::util::retries: Retrying after error. error=Request failed: status: Unavailable, message: "error trying to connect: tcp connect error: Connection refused (os error 111)", details: [], metadata: MetadataMap { headers: {} } internal_log_rate_limit=true
2024-10-02T15:50:24.833240Z DEBUG sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=398}: vector::sinks::util::retries: Retrying request. delay_ms=7551
2024-10-02T15:50:25.361765Z DEBUG transform{component_kind="transform" component_id=add_serial_metrics component_type=remap}: vector::utilization: utilization=0.00026859163336423795
2024-10-02T15:50:30.367008Z DEBUG transform{component_kind="transform" component_id=add_serial_metrics component_type=remap}: vector::utilization: utilization=0.0002844311638970257
2024-10-02T15:50:32.390641Z DEBUG sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=398}: hyper::client::connect::http: connecting to x.y.z.a:bbb
2024-10-02T15:50:32.392662Z  WARN sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=398}: vector::sinks::util::retries: Internal log [Retrying after error.] is being suppressed to avoid flooding.

Then, once a machine is back online, a lot of repeated messages like these: https://gist.github.com/freak12techno/a79d04e226d7e33819162a6da76cb144

Example Data

No response

Additional Context

No response

References

No response

@freak12techno freak12techno added the type: bug A code related bug. label Oct 3, 2024
@jszwedko
Copy link
Member

jszwedko commented Oct 3, 2024

Hi @freak12techno ,

Thanks for filing this issue. I'm a little confused though, when Vector is offline, it won't be able to collect metrics via the host_metrics and internal_metrics sources as both of those sources are "realtime" so I think what you are seeing is expected behavior. Am I missing something? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants