Statsd aggregation issue on some metrics with multiple schedulers running #26601

vDMG · 2022-09-22T15:55:24Z

Apache Airflow version

Other Airflow 2 version

What happened

When I am running multiple schedulers (>1) the statsd exporter does not correctly sum the running tasks from the different scheduler but it does just take one metric from one scheduler and display it as the value.

the metric name : airflow_executor_running_tasks

What you think should happen instead

airflow statsd exporter should sum up all running tasks from all the existing and running schedulers to calculate airflow_executor_running_tasks

How to reproduce

Run an airflow cluster with 2 schedulers and with statsd enabled.

Create enough tasks to balance them between the 2 schedulers. Using the UI you can check that between the displayed value and the airflow_executor_running_tasks metric exposed by statsd you will not have the same number?

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==3.4.0
apache-airflow-providers-celery==2.1.4
apache-airflow-providers-cncf-kubernetes==4.0.2
apache-airflow-providers-docker==2.7.0
apache-airflow-providers-elasticsearch==3.0.3
apache-airflow-providers-ftp==2.1.2
apache-airflow-providers-google==7.0.0
apache-airflow-providers-grpc==2.0.4
apache-airflow-providers-hashicorp==2.2.0
apache-airflow-providers-http==2.1.2
apache-airflow-providers-imap==2.2.3
apache-airflow-providers-microsoft-azure==3.9.0
apache-airflow-providers-mysql==2.2.3
apache-airflow-providers-odbc==2.0.4
apache-airflow-providers-postgres==4.1.0
apache-airflow-providers-redis==2.0.4
apache-airflow-providers-sendgrid==2.0.4
apache-airflow-providers-sftp==2.6.0
apache-airflow-providers-slack==4.2.3
apache-airflow-providers-sqlite==2.1.3
apache-airflow-providers-ssh==2.4.4

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2022-09-22T15:55:25Z

Thanks for opening your first issue here! Be sure to follow the issue template!

Yaro1 · 2023-03-12T13:34:47Z

please assign to me

Yaro1 · 2023-03-18T14:47:13Z

Well, in my opinion we have two ways:

try to sum up that metric for every scheduler
send metric with some id of scheduler and then users can sum it by themself

Also it's strange, but I found a duplicate of executor.running_tasks - scheduler.tasks.running.
But it seems like we have to delete it because it's always zero:

num_tasks_in_executor = 0

Stats.gauge("scheduler.tasks.running", num_tasks_in_executor)

It doesn't change and we have only two using of the variable num_tasks_in_executor.

What do you think? @potiuk @eladkal

Yaro1 · 2023-03-18T14:54:02Z

In 1. I mean, that we have to change logic of heartbeat for scheduler, cause for now it's sending for each scheduler independently.

Yaro1 · 2023-03-18T14:57:26Z

But we can have some separate process for getting heartbeat from db and sending it, in that way it seems clear and better for me

eladkal · 2023-03-19T19:22:58Z

To my perspective each scheduler should have it's own metric as it's a separated process. Summing/aggregating might hide underlying issue. To my knowledge, most "metric display/tracker" tools have the ability to aggregate from their side so I'd prefer to give data as raw as we can. However this is not my area of expertise thus lets hear what others think. cc @ferruzzi WDYT?

potiuk · 2023-03-19T19:47:03Z

Would be complex and brittle and is against the "distributed" system architecture, we would have to elect the "leader" (if we want to publish it by one of the schedulers or add a new entity to publish such metrics. Sounds complicating and smells zookeper (https://zookeeper.apache.org/) which is something we definitely want to avoid.
The problem is that currently we have no unique scheduler id that we can rely on. The schedulers are "identical" copies of the same process and we neither enumerate them, nor have another way of distinguishing them. So far at least.

Yes. Option 2. is better but the fact that we cannot distinuish them, makes it dificult, both to generate the metrics but even more difficult to aggregate them. The problem is that when we restart scheduler, the id (possibly) should not change, otherwise the function aggregating the metrics will see that as "another" scheduler. This migh or might not be a problem - depending on the metrics.

We have few options there:

use uuid or simlar at the start of the scheduler to generate the name - we have a few UUID candidates there https://docs.python.org/3/library/uuid.html but none of them guarantees that the id will remain unchanged after the restart (especially in container environment). It could be random or hardware-based (or both).
similarly to workers we use hostname callable to get a unique id of teh scheduler when it starts. this is not a fool-proof either, because the hostname/IP might also (and often does) change after the restart.

The problem is tha Option 2) and non-random UUID generation does not guarantee uniqueness if a user chooses to run several schedulers at the same hardware without containerisation - 2 schedulers running on the same hardware will have the same id (unless randomness is involved). This is not very likely, but possible scenario - because running multiple schedulers on the same hardware with multi-processing is generally a good idea (because such schedulers will be faster due to Python multi-processing/GIL limitaitons). Probably the UUID calculation and hostname should require also the port number on which scheduler is exposing healthcheck or something like that.

add a flag to specify scheduler ID when starting it (this might override 1) or 2) in cases when the ID is expected to remain the same for specific deployment.

I think the metrics in question should be fine to aggregate if the UUID is randomly allocated (I have not looked at those that are involved so it needs to be checked metrics-by-mytrics. However it will result in having changing uuids over time while scheduler is running and being restarted. Not sure how statsd monitoring software / aggregation will handle it - something to be investingated by someone doing such a change. Also the 3) might be a way to limit the impact of that, but at the expense of complicating deployment. Currently we add a new scheduler by just starting another scheduler process. If we also add the option to specify the id of the scheduler when starting, the Helm Chart of ours should likely be updated with that option. And documentation should be written to include that option.

Generally doable, but just wanted to make everyone aware in the discussion that is quite a bit more complex than "just send metrics with unique id". It would be simpler if we had the id, but we don't have it currently and there are implications of having it.

Yaro1 · 2023-03-19T20:49:45Z

okay, I will investigate the 2nd option.

ferruzzi · 2023-03-20T17:54:50Z

Thanks for the tag, @eladkal. Yeah, I agree with how the conversation has progressed for sure. Separate in name (and add the name to the metric tags) seems like it would be ideal. @potiuk's response about the difficulties in that definitely makes it sound like a challenge.

Since I mentioned taking on the OTel integration, I've had one or two people tell me that not all of the metrics listed on our docs page work, so i planned to double check as many as I can once I have OTel working and update that page or he code as appropriate, but that's a bit of a longer-term goal at this point. Thanks for pointing this one out, @vDMG, and thanks for taking it on @Yaro1 👍

github-actions · 2024-03-21T07:01:13Z

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

github-actions · 2024-04-21T07:00:56Z

This issue has been closed because it has not received response from the issue author.

kaxil · 2024-08-22T14:59:05Z

@ferruzzi Worth looking at this again as part of the workstream where you guys are looking to audit all metrics

vDMG added area:core kind:bug This is a clearly a bug labels Sep 22, 2022

potiuk added the good first issue label Sep 22, 2022

potiuk assigned Yaro1 Mar 12, 2023

github-actions bot added the Stale Bug Report label Mar 21, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 21, 2024

kaxil reopened this Aug 22, 2024

kaxil removed the Stale Bug Report label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statsd aggregation issue on some metrics with multiple schedulers running #26601

Statsd aggregation issue on some metrics with multiple schedulers running #26601

vDMG commented Sep 22, 2022

boring-cyborg bot commented Sep 22, 2022

Yaro1 commented Mar 12, 2023

Yaro1 commented Mar 18, 2023 •

edited

Loading

Yaro1 commented Mar 18, 2023

Yaro1 commented Mar 18, 2023

eladkal commented Mar 19, 2023 •

edited

Loading

potiuk commented Mar 19, 2023 •

edited

Loading

Yaro1 commented Mar 19, 2023

ferruzzi commented Mar 20, 2023

github-actions bot commented Mar 21, 2024

github-actions bot commented Apr 21, 2024

kaxil commented Aug 22, 2024

Statsd aggregation issue on some metrics with multiple schedulers running #26601

Statsd aggregation issue on some metrics with multiple schedulers running #26601

Comments

vDMG commented Sep 22, 2022

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Sep 22, 2022

Yaro1 commented Mar 12, 2023

Yaro1 commented Mar 18, 2023 • edited Loading

Yaro1 commented Mar 18, 2023

Yaro1 commented Mar 18, 2023

eladkal commented Mar 19, 2023 • edited Loading

potiuk commented Mar 19, 2023 • edited Loading

Yaro1 commented Mar 19, 2023

ferruzzi commented Mar 20, 2023

github-actions bot commented Mar 21, 2024

github-actions bot commented Apr 21, 2024

kaxil commented Aug 22, 2024

Yaro1 commented Mar 18, 2023 •

edited

Loading

eladkal commented Mar 19, 2023 •

edited

Loading

potiuk commented Mar 19, 2023 •

edited

Loading