Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spool to disk causes Beats to hang if file is already locked #10653

Closed
JohnLyman opened this issue Feb 8, 2019 · 8 comments
Closed

Spool to disk causes Beats to hang if file is already locked #10653

JohnLyman opened this issue Feb 8, 2019 · 8 comments
Labels
bug good first issue Indicates a good issue for first-time contributors libbeat needs_backport PR is waiting to be backported to other branches. Stalled Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@JohnLyman
Copy link

JohnLyman commented Feb 8, 2019

If a current running beat has a spool file configured (for example, queue.spool.file' => {}) and a config test is run against a configuration that specifies the same spool file, the configuration test will hang indefinitely.

I think this is due to the original process having an exclusive lock on the spool file. Here is the relevant strace output:

[pid 2218] lstat("/var/lib/filebeat/spool.dat", {st_mode=S_IFREG|0600, st_size=101916672, ...}) = 0
[pid 2218] openat(AT_FDCWD, "/var/lib/filebeat/spool.dat", O_RDWR|O_CREAT|O_CLOEXEC, 0600) = 5
[pid 2218] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2808401664, u64=140491188653824}}) = -1 EPERM (Operation not permitted)
[pid 2218] epoll_ctl(4, EPOLL_CTL_DEL, 5, 0xc4204f101c) = -1 EPERM (Operation not permitted)
[pid 2218] flock(5, LOCK_EX <unfinished ...>
[pid 2219] <... pselect6 resumed> )    = 0 (Timeout)
[pid 2219] epoll_pwait(4, [], 128, 0, NULL, 6810877) = 0
[pid 2219] pselect6(0, NULL, NULL, NULL, {0, 10000000}, NULL) = 0 (Timeout)
[pid 2219] epoll_pwait(4, [], 128, 0, NULL, 6810877) = 0
[pid 2219] futex(0xc420096d48, FUTEX_WAKE, 1) = 1
[pid 2230] <... futex resumed> )       = 0
[pid 2219] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid 2230] futex(0xc420096d48, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 2219] <... pselect6 resumed> )    = 0 (Timeout)
[pid 2219] futex(0x20ee650, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid 2221] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 2221] futex(0x20ee650, FUTEX_WAKE, 1) = 1
[pid 2219] <... futex resumed> )       = 0
[pid 2221] futex(0xc420096d48, FUTEX_WAKE, 1 <unfinished ...>
[pid 2230] <... futex resumed> )       = 0
...

I am using filebeat & metricbeat 6.6.0 on CentOS 7.

I was not able to test the version that fixes a similar issue in #9874.

@urso
Copy link

urso commented Apr 17, 2019

The file must not be accessed by 2 processes at the same time. This is why the file gets locked. Otherwise we'd risk data corruption.

File locking is no bug, but a safety mechanism for users to not break the spool.

@urso urso closed this as completed Apr 17, 2019
@JohnLyman
Copy link
Author

I'm surprised this was closed so dismissively without further discussion. I agree, file locking is not a bug, but having a test config option that does not work in certain cases is definitely a bug.

It's not unreasonable for a user to expect test config against a perfectly valid configuration to return gracefully and without errors. It's also not unreasonable to expect the behavior of test config to be the same whether beats is running or not.

I would be perfectly happy if the test just ignored the queue.spool setting with a warning. Even timing out with an error if it fails to obtain the lock would be better than the current behavior. At bare minimum, the documentation should state that the command will hang indefinitely on valid configurations that include a configured spool file.

@urso
Copy link

urso commented Apr 30, 2019

I would be perfectly happy if the test just ignored the queue.spool setting with a warning

The way test config works it's currently not possible to ignore queue.spool.

Even timing out with an error if it fails to obtain the lock would be better than the current behavior.

Agreed, we should not block but quit with an error.

@urso urso reopened this Apr 30, 2019
@urso urso changed the title Spool to disk causes test config to hang Spool to disk causes Beats to hang if file is already locked Apr 30, 2019
@urso urso added the good first issue Indicates a good issue for first-time contributors label Apr 30, 2019
@dmlary
Copy link

dmlary commented Aug 5, 2019

Is there any workaround for this? I want to run setup without stopping auditbeat first. I've tried all manner of -E queue.spool=nil, -E queue.spool=false, without any luck.

@andrewkroh
Copy link
Member

I use -E queue.spool.enabled=false in my Ansible playbooks when I want to run test config. This might work for setup too.

@urso urso mentioned this issue Dec 11, 2019
39 tasks
@urso urso added the needs_backport PR is waiting to be backported to other branches. label Jan 6, 2020
@botelastic
Copy link

botelastic bot commented Jan 27, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@botelastic botelastic bot added Stalled needs_team Indicates that the issue/PR needs a Team:* label labels Jan 27, 2022
@mtojek mtojek added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 28, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed needs_team Indicates that the issue/PR needs a Team:* label Stalled labels Jan 28, 2022
@botelastic
Copy link

botelastic bot commented Jul 7, 2023

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Jul 7, 2023
@botelastic botelastic bot closed this as completed Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good first issue Indicates a good issue for first-time contributors libbeat needs_backport PR is waiting to be backported to other branches. Stalled Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

7 participants