Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lifx integration with many devices frequently goes unavailable #78876

Closed
mspinolo opened this issue Sep 21, 2022 · 137 comments · Fixed by #91157
Closed

Lifx integration with many devices frequently goes unavailable #78876

mspinolo opened this issue Sep 21, 2022 · 137 comments · Fixed by #91157
Assignees

Comments

@mspinolo
Copy link

The problem

Since the latest integration update I have a lot of occurrences of LIFX light becoming "not available".
This is happening on most (but not all) of them (I have 20+ lights).

The behavior is not consistent during the day which makes me suspect there is some relation with the wifi environment (I have 3 AP broadcasting the same SSID on 1-6-11 channel), but by AP logs it doesn't seem to be related to LIFX disconnecting from one AP and reconnecting to the other.

Also I see they usually become unavailable for 10s then coming back online: I ask myself if this has something to do with polling rate cycle of the integration as I see from integration discovery interval is 10 (seconds?)

"""Const for LIFX."""

import logging

DOMAIN = "lifx"

TARGET_ANY = "00:00:00:00:00:00"

DISCOVERY_INTERVAL = 10
MESSAGE_TIMEOUT = 1.65
MESSAGE_RETRIES = 5
OVERALL_TIMEOUT = 9
UNAVAILABLE_GRACE = 90

so could it be that discovery, in my environment, simply can't keep the pace and drops connections?

What version of Home Assistant Core has the issue?

2022.9.5

What was the last working version of Home Assistant Core?

the one before LIFX integration update

What type of installation are you running?

Home Assistant OS

Integration causing the issue

LIFX

Link to integration documentation on our website

https://www.home-assistant.io/integrations/lifx/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

@probot-home-assistant
Copy link

lifx documentation
lifx source
(message by IssueLinks)

@probot-home-assistant
Copy link

Hey there @bdraco, @Djelibeybi, mind taking a look at this issue as it has been labeled with an integration (lifx) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)

@Djelibeybi
Copy link
Contributor

It's more likely that your bulbs have always been doing this, we're just better at reporting it now than before. Does Home Assistant consistently re-establish connectivity to each bulb? Are they responsive to automation and manual control?

@mspinolo
Copy link
Author

so in general they are re-establishing connection to HA.

In the past I never had automation / responsiveness issues, while now it sometimes happen when the state is unavailable and an action shot.

Not sure if something happened HA side (ex. increase amount of broadcast traffic) which made worse the situation recently.
I have quite a lot of wifi devices in a single Lan segment which can be the issue (I need to segregate into VLAN at some stage but I can't find time for this).
Airtime shouldn't be an issue as devices are split through 3 APs (20-25 each)

@Djelibeybi
Copy link
Contributor

Djelibeybi commented Sep 21, 2022

There shouldn't have been any significant increase in the amount of traffic, but we are interacting with the bulbs more than before. If you don't use HomeKit, it may be worth integrating your bulbs using Home Assistant's HomeKit Controller integration instead, as that uses local push, instead of polling the bulbs every 10 seconds.

If you do you use HomeKit, you still can by connecting them to Home Assistant first, then exporting them to HomeKit from HASS.

@mspinolo
Copy link
Author

Yes I read it and also that is in my todo list: should be a much better way to controlling bulbs.
Unluckily I have some Z strips which are not homekit compliant, hence for those I believe I will have to stick to LIFX integration.

When you say "we are interacting more with the bulbs" what are you referring to in details?

@Mincka
Copy link

Mincka commented Oct 4, 2022

I also see a LOT more of this kind of error messages since I use the new integration.
The led strip becomes unavailable frequently and I have this in the logs:

2022-10-04 16:40:08.771 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 16:57:58.259 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 17:30:03.269 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 17:43:10.296 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 17:48:25.046 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 17:50:58.265 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:06:03.258 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:11:40.260 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:14:02.258 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:26:12.260 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:52:03.934 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:54:15.258 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:57:23.262 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 19:02:22.266 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 19:14:55.286 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 19:19:45.260 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data

In my case, the Wi-Fi coverage is weak in this room, but it never was an issue to use the led strip.
Before muting the integration in the logs in my case, I am going to try the HomeKit integration. I just need to improve by Bluetooth coverage first. Thanks for the suggestion.

@Djelibeybi
Copy link
Contributor

@bdraco I wonder if we shouldn't make the timeout a little less noisy? Perhaps only report if the device hasn't recovered after some amount of time? Most of my timeouts recover within the next 10 second window, for example. Usually much quicker.

@bdraco
Copy link
Member

bdraco commented Oct 4, 2022

Does increasing the timeout allow it to go though? If we suppress the log message, the device will still be marked unavailable and lead to questions about why.

@Djelibeybi
Copy link
Contributor

Let me test and get back to you on that.

@Djelibeybi
Copy link
Contributor

Djelibeybi commented Oct 5, 2022

I'm not getting any timeouts when running the latest dev that uses extended multizone messages and considering the issue is with a strip, I'd like to see if this is still an issue once that code is in a stable release.

TL;DR: this may already be fixed in dev via #79444

@alystair
Copy link

alystair commented Oct 9, 2022

I'm having a similar issue... I only have Z / Z2 strips...
image

They only came back online after a HA hardware reboot, had no way as a novice end user to force some sort of manual check.

Is there a way to update temporarily to the dev build, or maybe it's not worth the effort... is there an approximate ETA for when the potential fix will hit stable? (are we talking weeks / months)

@alystair
Copy link

alystair commented Oct 9, 2022

One LIFX-Z failed to come back on HA (still worked without issues via Alexa / LIFX app. Removed device from HA list, then tried looking for it again via discovery. Was discovered as a weird name (serial number/mac?) and not the set one ('Bath' in this case). Even after adding it not only does it not show up in the device list, LIFX integration is no longer finding it... hardware reboot did not bring it back. Advice would be appreciated.

@tankdeer
Copy link

I am having the same issue as well. Reloading the integration tends to fix the issue, but it often reoccurs a short time later

@mspinolo
Copy link
Author

I made various test changing stuff on my network trying to reduce as much as I can multicast traffic but no luck.
My aim was to make such that polling all my lifx bulbs/strips was feasible in 10s which I think is a too short turnaround time.

I will try to move some bulbs to HomeKit and see if I will improve the situation which is quite annoying at the moment

@Djelibeybi
Copy link
Contributor

If you haven't disconnected your bulbs from the LIFX Cloud, that's another thing you should do to reduce the CPU load on the bulbs themselves. This assumes you don't use themes or schedules defined in the LIFX app, as those require cloud connectivity.

@mspinolo
Copy link
Author

They are disconnected, all of them.
All worked well with no hick-ups for 2y before integration update

@Djelibeybi
Copy link
Contributor

I'm not denying there is something going on with the way the integration currently does discovery, but it's proving extremely difficult to isolate or reproduce in a controlled environment. Especially considering discovery is improved for most folks.

@mspinolo
Copy link
Author

I know it is not, don’t take too bad my comment.
My feeling is it is a wrong mix of polling frequencies / retries which lead to this.

I previously had experiences with a (probably) faulty lifx bulb which just disconnected and hanged on every HA restart (with previous version of integration) I think due to the burst of multicast/polling HA was shooting.
I believe lifx bulb have very poor bandwidth and go banana when “flooded”.

In my environment I don’t have the same intermittent disconnection for all bulbs: I have more for the one with weaker signal (still decent thought like -70dB).
So I think is a mix of wifi radio environment, positioning, number of bulbs.

Likely this is just showing polling is not a robust way of communication.
Not sure if there is something different that can be done within Lifx integration.

Now I migrated few lights to HomeKit: let’s see if it will be better

@melbs2
Copy link

melbs2 commented Oct 17, 2022

I am having the same issue as well. 2+ years of ~99.9% uptime, now im experiencings multiple long dropouts across my 20 bulbs every day

@Djelibeybi
Copy link
Contributor

Yeah, I have a hypothesis as to the cause of this, I just need some spare time to refactor things to see if it's valid or not. I'm hoping to get to it this weekend.

@bdraco bdraco changed the title Issue with Lifx integration Lifx integration with many devices frequently goes unavailable Nov 23, 2022
@bdraco
Copy link
Member

bdraco commented Nov 23, 2022

There is a thundering herd problem with the coordinators that cause all polling to be aligned at microsecond 0 that is fixed in 2022.12.x that might help this issue

@bdraco
Copy link
Member

bdraco commented Nov 23, 2022

The thundering heard fix at 0 microseconds
#82233

@Djelibeybi
Copy link
Contributor

I've been trying to track this down for ages. I'm really glad you found the cause.

@mspinolo
Copy link
Author

In case this can help I migrated all my lifx light to HomeKit controller integration: since then (3+ weeks ago) had zero disconnections

@melbs2
Copy link

melbs2 commented Nov 24, 2022

Thank you all for your input, i will holdout for 2022.12 as this issue is still persisting. I will use @mspinolo suggestion if the issue remains post update 🙏

@Djelibeybi
Copy link
Contributor

@melbs2 I have some stuff I'm testing on top of @bdraco's fix for the thundering herd that is showing a lot of promise. There is still an issue with very old devices (like Beams or Tiles) but otherwise, I'm quite happy with the way my flock of 60 devices is behaving.

@bdraco
Copy link
Member

bdraco commented Apr 5, 2023

It's more likely that your bulbs have always been doing this, we're just better at reporting it now than before.

Yes, they have always been doing this. LIFX hardware is flaky. Reporting intermittent dropouts is not helpful and the previous suppression was intentional.

Can we please reinstate something like UNAVAILABLE_GRACE to fix the 10s flip-flop regression?

That seems like a good idea as I don't think we are going to be able to come up with a software fix for a hardware issue.

@Djelibeybi
Copy link
Contributor

Djelibeybi commented Apr 5, 2023

LIFX bulbs use UDP exclusively thus there is no concept of connection and thus there is nothing to drop out. Discussion of hardware flakiness aside, UDP has no retransmission built-in, so it's best effort.

We just need to stop raising exceptions when it doesn't reply and either retry ourselves or log a "oh, shucks we tried" message.

The other other option is replacing aiolifx with Photons which eliminates the issue completely but is way more complex to work with. I'm writing an API shim for the framework just to simplify the implementation.

bdraco added a commit that referenced this issue Apr 5, 2023
These devices sometimes flakey and generate a lot of noise
from drop outs since communication is UDP best-effort. We
should only mark them unavailable if its not a momentary blip

fixes #78876
@bdraco
Copy link
Member

bdraco commented Apr 5, 2023

It's more likely that your bulbs have always been doing this, we're just better at reporting it now than before.

Yes, they have always been doing this. LIFX hardware is flaky. Reporting intermittent dropouts is not helpful and the previous suppression was intentional.

Can we please reinstate something like UNAVAILABLE_GRACE to fix the 10s flip-flop regression?

#90872

Needs some tests but out of time to do that right now

@amelchio
Copy link
Contributor

amelchio commented Apr 5, 2023

The other other option is replacing aiolifx with Photons which eliminates the issue completely [...]

I believe this is an optimistic view which implies that you have not yet accepted that LIFX hardware is flaky.

Yes, LIFX hardware might work almost fine in perfect conditions. Those conditions include things like abandoning the first few LIFX generations, using an idle wifi, pointing antennas just right, placing bulbs outside of lampshades and keeping neighbors from popping corn.

Take a look at the LIFX firmware release notes, each release features "improved connectivity". I bet the next one will too.

@Djelibeybi
Copy link
Contributor

I believe this is an optimistic view which implies that you have not yet accepted that LIFX hardware is flaky.

No, it takes the view that the framework written by the LIFX employee to power the LIFX Cloud is probably the best thing to use to manage LIFX devices at scale.

@amelchio
Copy link
Contributor

amelchio commented Apr 5, 2023

Both of our statements can be true.

@Djelibeybi
Copy link
Contributor

Djelibeybi commented Apr 6, 2023

Both of our statements can be true.

Both statements are opinions, so they don't have to be. 😏 But point taken. Photons just takes a whole different approach to almost any other library in any language, which makes it far more robust, but doesn't fit well with Home Assistant's device/entity POV. I've (mostly) created a shim layer that presents an aiolifx-like API to Home Assistant powered by Photons. And then I (mostly) created another one that does the same using Photons Interactor.

Edited to add that "mostly" actually means "got it to work sufficiently to provide a path to MVP" but not actual MVP state.

@MSIMaker
Copy link

MSIMaker commented Apr 6, 2023

On a slight tangent here....but for my own interest. What router/modem do you have if you experience this issue?

I have the ASUS AX11000

I can see the lights disappear in my router monitor and then come back. I have some set as static ip and some dhcp....both react the same.....dropped for a few seconds and then come back.

This is without HA even running. I am wondering if router brand or a setting is making any difference here and the issue is not with HA at all.

@Djelibeybi
Copy link
Contributor

I have a Ubiquiti setup and yes, the wifi on LIFX devices is ... interesting. I suspect (though can't prove) that the microcontrollers are quietly rebooting fairly often which results in the high DHCP requests. Certainly on the linear multizone devices (Z, Beam, Lightstrip) they're rebooting without triggering a resync.

@alexruffell
Copy link

alexruffell commented Apr 6, 2023

This may have nothing to do with this issue but a couple of years ago I had highly unstable LIFX lights that kept dropping off HA. It turned out to be a botched mDNS implementation (possibly also related to my bulbs being on a different VLAN) on my Unifi networking gear. Once they fixed that, the instability vanished overnight.

Djelibeybi pushed a commit to Djelibeybi/home-assistant-core that referenced this issue Apr 7, 2023
These devices sometimes flakey and generate a lot of noise
from drop outs since communication is UDP best-effort. We
should only mark them unavailable if its not a momentary blip

fixes home-assistant#78876
Djelibeybi pushed a commit to Djelibeybi/home-assistant-core that referenced this issue Apr 8, 2023
These devices sometimes flakey and generate a lot of noise
from drop outs since communication is UDP best-effort. We
should only mark them unavailable if its not a momentary blip

fixes home-assistant#78876
Djelibeybi pushed a commit to Djelibeybi/home-assistant-core that referenced this issue Apr 9, 2023
These devices sometimes flakey and generate a lot of noise
from drop outs since communication is UDP best-effort. We
should only mark them unavailable if its not a momentary blip

fixes home-assistant#78876
Djelibeybi pushed a commit to Djelibeybi/home-assistant-core that referenced this issue Apr 9, 2023
These devices sometimes flakey and generate a lot of noise
from drop outs since communication is UDP best-effort. We
should only mark them unavailable if its not a momentary blip

fixes home-assistant#78876
Djelibeybi pushed a commit to Djelibeybi/home-assistant-core that referenced this issue Apr 9, 2023
These devices sometimes flakey and generate a lot of noise
from drop outs since communication is UDP best-effort. We
should only mark them unavailable if its not a momentary blip

fixes home-assistant#78876
@bdraco
Copy link
Member

bdraco commented Apr 10, 2023

Can we please reinstate something like UNAVAILABLE_GRACE to fix the 10s flip-flop regression?

#91157 uses the value for UNAVAILABLE_GRACE

@bdraco
Copy link
Member

bdraco commented Apr 10, 2023

#91157 should be ready for testing now

@bdraco
Copy link
Member

bdraco commented Apr 10, 2023

I'll open PRs to aiolifx to fix some of the underlying issues in the library which should improve reliability:

@amelchio
Copy link
Contributor

Good catch @bdraco!

bdraco added a commit that referenced this issue Apr 13, 2023
@bdraco bdraco mentioned this issue Apr 13, 2023
20 tasks
@bdraco
Copy link
Member

bdraco commented Apr 13, 2023

2023.4.4 has the new version of aiolifx with fixes in it so it would be nice to know if it improves the situation for anyone.

@bdraco
Copy link
Member

bdraco commented Apr 13, 2023

Also #91157 isn't in a public build yet though which is the bigger change.

@MSIMaker
Copy link

Installing 2023.4.4 right now and clearing the logs. We shall see how it goes.

But as aside, I removed my ASUS GT11000 router and set my Telstra Smart Modem back to router mode and let it manage my home and the drops outs have almost stopped completely and HA is more stable than it ever has been before. So I am suspected that there are issues within that router as well as some Lifx issues which contribute together here.

The ASUS router is going back to ASUS under RMA and if they replace it, I will try it again. But for now the SM3 is working a treat.

@github-actions github-actions bot locked and limited conversation to collaborators May 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.