Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NRZ-2018-107: Very unstable config network #242

Closed
Informatic opened this issue Sep 1, 2018 · 58 comments
Closed

NRZ-2018-107: Very unstable config network #242

Informatic opened this issue Sep 1, 2018 · 58 comments
Labels

Comments

@Informatic
Copy link
Contributor

Informatic commented Sep 1, 2018

Hey!

We've been working on getting luftdaten project going in Warsaw/Poland. So far we've encountered a pretty bad problems with initial configuration - wifi clients connecting to Feinstaubsensor-... network either get randomly disconnected, and when they finally get connected, we get average of ~40-60% packet loss to ESP8266 IP, and web interface is (obviously) very unstable. (ie. keeps on constantly loading in browsers, needs multiple refreshes, etc...) I've looked at debug serial output and the sensor is not rebooting/crashing.

Finally, after initial configuration sensors work perfectly fine, and we don't see any packet losses when pinging/accessing web interface in local network.

We've tested it on multiple different ESP8266 boards (4x NodeMCU Lolin v3 from the same batch, one random Wemos D1 Mini) and they all behave very similar. On the other hand side, software based on Sming/plain Espressif SDK seems to be working fine on these, so this doesn't sound like problem with the boards themselves, does it? (I've done some ESP8266 development myself before)

We've tested it on:

  • Random Windows 7 laptop
  • Random consumer Lenovo Windows 10 laptop
  • Thinkpad x260 running (K)Ubuntu 16.04
  • Motorola Moto G5S running Android 7.1.1

This sounds pretty similar to issues stated in #88

@ricki-z
Copy link
Member

ricki-z commented Sep 1, 2018

There seems to be a memory issue in access point mode. We need to shorten the config page. But I can't 'read' or change the polish translation. Is there a chance to shorten some of the texts to save some RAM and flash?
I don't know why but until first measurement (or in AP mode) the config page seems to need too much RAM. In our tests the page was empty until first measurement. Only after shortening the html source the page was also shown before the first measurement.

@Informatic
Copy link
Contributor Author

It definitely looks like polish translation would benefit from a review. I'll look into it later.

Now, as far as I remember, the same issues happened on english / german version as well. I'll try to reproduce it again today. Maybe it was a power consumption issue all along? We'll see.

@Informatic
Copy link
Contributor Author

I'm again able to reproduce the same issue on latest_de.bin firmware. Interestingly, Wemos D1 Mini board I tested it first on worked fine now, and when I switched to NodeMCU one, it's broken again... Same cable used, just replaced the board on microUSB connector side.

When opening the web interface I get "ERR_CONTENT_LENGTH_MISMATCH" on Chrome as well, but I think this is just caused by TCP connection closing before end of content. (caused by packet drops)
Again, average 40-60% packet drop.

I haven't played with AP mode on ESP8266 lately, but AP shows up as late as after 10-20 seconds of refreshes on Android after "Starting WiFiManager" message is sent over Serial. I recall it being much faster (like 5 seconds tops) on Espressif SDK a year or so ago, but maybe that's just my failing memory... :)

@Informatic
Copy link
Contributor Author

...and I've just disconnected USB-A side and connected it back again, opened serial monitor (picocom, not the Arduino one) and it worked perfectly fine now on the same board... 0% packet loss.

@Informatic
Copy link
Contributor Author

...and back to being broken after another power cycle... :)

64 bytes from 192.168.4.1: icmp_seq=46 ttl=128 time=6.37 ms
64 bytes from 192.168.4.1: icmp_seq=47 ttl=128 time=21.4 ms
64 bytes from 192.168.4.1: icmp_seq=49 ttl=128 time=90.9 ms
64 bytes from 192.168.4.1: icmp_seq=50 ttl=128 time=4.92 ms
64 bytes from 192.168.4.1: icmp_seq=51 ttl=128 time=16.8 ms
64 bytes from 192.168.4.1: icmp_seq=56 ttl=128 time=4.14 ms
64 bytes from 192.168.4.1: icmp_seq=57 ttl=128 time=11.7 ms
64 bytes from 192.168.4.1: icmp_seq=71 ttl=128 time=4.39 ms
64 bytes from 192.168.4.1: icmp_seq=72 ttl=128 time=65.5 ms
64 bytes from 192.168.4.1: icmp_seq=76 ttl=128 time=5.11 ms
64 bytes from 192.168.4.1: icmp_seq=77 ttl=128 time=16.6 ms
64 bytes from 192.168.4.1: icmp_seq=79 ttl=128 time=3.27 ms
64 bytes from 192.168.4.1: icmp_seq=80 ttl=128 time=9.96 ms

@ricki-z
Copy link
Member

ricki-z commented Sep 2, 2018

Do you try to connect to the sensor if it is connected to an USB port? Some USB ports limit the possible power. Then this may not be enough for a stable wifi connection in AP mode. It's better to use a power supply with minimum 1A for configuration.
And the ESP8266 needs a complete restart (power off) after flashing.

@LuchtwachtersDelft
Copy link
Contributor

During our workshops we have experienced the same difficulty with the initial connection in AP mode.

I am working on a branch that uses WiFiManager for the initial connection just to let the user configure the SSID and password in a captive portal instead of showing the full configuration page. The results of this approach so far are encouraging.

@ricki-z
Copy link
Member

ricki-z commented Sep 10, 2018

Using WiFiManager would create a larger firmware image. At some point we can't update OTA anymore. But we could limit the config page to the wifi part only in AP mode. Then users need to change anything else when the sensor is connected to the wifi network.

@LuchtwachtersDelft
Copy link
Contributor

After doing some more tests it seems the size of the config page is OK. I think the real culprit is enabling STA and AP mode at the same time.

Go to airrohr-firmware.ino and try adding

WiFi.disconnect(true);

and/or

WiFi.mode(WIFI_AP);

at the top of the wifiConfig() function.

@LuchtwachtersDelft
Copy link
Contributor

On Android adding those lines have the added benefit of making the "Sign in to network" popup usable.

With

WiFi.disconnect(true);
WiFi.mode(WIFI_AP);

screenshot_20180910-171926

Without:

screenshot_20180910-172737

@ricki-z
Copy link
Member

ricki-z commented Sep 10, 2018

I have inserted the mentioned lines. This seems to work. I have pushed this to the beta branch. If someone else could test this and it's okay I would push this change to the master branch and publish this as the new release firmware.

@Informatic
Copy link
Contributor Author

Informatic commented Sep 10, 2018

Wow, this seems to work much better now!

Latest english beta build, in case anyone wanted to try this, without building it locally:
https://d.inf.re/b/3fb7ee190986454fb2a003cd7a6439b3.bin (...just noticed madavi.de update builds have been updated, disregard.)

@ricki-z
Copy link
Member

ricki-z commented Sep 10, 2018

I have pushed the beta to our update server (needed some time for all language versions ;-) ). So you can also download them at: https://www.madavi.de/sensor/update/data/

@ricki-z
Copy link
Member

ricki-z commented Sep 10, 2018

We have too few people looking at the code. So many thanks for your work @LuchtwachtersDelft . After 2 years of work on the firmware it's sometimes hard to find something like this.

@LuchtwachtersDelft
Copy link
Contributor

Much appreciated. This fix means I can finally help the people who came to our workshops but never managed to connect their sensors. And I can go ahead with the next workshops.

While I was working with WiFiManager I noticed the captive portal feature. A captive portal improves the user experience, because users don't have to go to their browser and enter 192.168.4.1. Airrohr has the "Sign in to network" popup on Android, as shown above, but not on iOS. Do you think it's worth expanding the captive portal feature for Airrohr on iOS and should I open a new feature issue for this?

@ricki-z
Copy link
Member

ricki-z commented Sep 10, 2018

If you know what iOS is looking for I can add it to the available paths. But I don't have an iOS device to test this. For iOS users this would be nice.

@LuchtwachtersDelft
Copy link
Contributor

I'll open a separate issue for captive portal.

@ricki-z ricki-z added the bug label Sep 10, 2018
@LuchtwachtersDelft
Copy link
Contributor

LuchtwachtersDelft commented Sep 11, 2018

Still seems not that stable after the fix when the AP has a password (FS_PWD). It shows the config page, but subsequent reloads of the config page turn up white.

@ricki-z
Copy link
Member

ricki-z commented Sep 11, 2018

Does it also show a blank page if you comment out some parts (i.e. the sensor configs)?
To connect the sensor to a network the fields SSID and password should be enough. All other things could be configured later.

@ricki-z
Copy link
Member

ricki-z commented Sep 11, 2018

Could you check the actual beta? I have checked this version with my Android phone and had no blank pages (sensor as WPA access point).

@LuchtwachtersDelft
Copy link
Contributor

LuchtwachtersDelft commented Sep 11, 2018

I have been testing the upstream beta when I reported the white page issue.

Removing the optional parts from the config page seems to help. It improves user experience too, because the full config page tends to overwhelm and confuse new users. The OLED/LED section could remain, but I usually enable the correct screen in ext_def.h anyway.

But making the config page smaller doesn't explain why setting a password triggered the white page. I think it's a combination of causes, but in the past hours I haven't been able to come up with a solid explanation.

I suspected the WiFi.disconnect() calls. For some reason having 2 of them in wifiConfig() combined with setting fs_pwd and showing the full config page can cause the white page to appear on iOS.

@LuchtwachtersDelft
Copy link
Contributor

LuchtwachtersDelft commented Sep 11, 2018

Maybe some other WiFi devices are interfering. I just noticed that my Android tablet keeps trying to connect to the AP, and fails because it doesn't have the correct password, while I'm busy with the iPhone. But then I'd expect more trouble when fs_pwd is empty.

The white page doesn't happen as often anymore now, even with the full config page, both WiFi.disconnect()'s, the fs_pwd not empty, and the Android tablet interfering. What triggers the white pages remains a mystery.

In any case, this fix is much more stable than no fix at all. Add the minimal config page then I think we've got a good candidate for a public release.

@LuchtwachtersDelft
Copy link
Contributor

White page also happens on Android when fs_pwd is set. Check out the difference in signal quality.

image

image

@ricki-z
Copy link
Member

ricki-z commented Sep 12, 2018

I have found two possible reasons for blank pages.

  1. We are (were) scanning for available networks, after the config page is loaded. So there may be a change in wifi mode (AP -> STA/AP -> AP) while scanning. I have changed this in the version of today so that the scan is done before going to station mode.
  2. By default the AP is set to channel 1. This may cause some problems if there are other APs on the same channel. At my location I can see 3 other APs on channel 1, but all of them with low signal strength.
    Reason 1 should be avoided in the version of today. So if someone could test this I may not need to program a solution for reason 2 (searching channel with lowest signals).

@ricki-z
Copy link
Member

ricki-z commented Sep 12, 2018

Okay, solution for reason 2 is implemented also. Firmware should select channel with lowest signal from channels 1,6 and 11.

@LuchtwachtersDelft
Copy link
Contributor

Could reason 2 explain also why WPA2 triggers blank pages more often than an unprotected connection and why a small page is able to pass through before too many packets are lost?

I'm in an area that is absolutely saturated with WiFi. About 20 APs here all the time.

@ricki-z
Copy link
Member

ricki-z commented Sep 12, 2018

I don't know how many overhead WPA2 adds to the transmitted data. But the crypto funtions are time and RAM consuming. All data needs twice the RAM while encrypting. WPA2 and also the HTTPS/TLS connections (i.e. data transmission to or servers) can cause memory problems.
So yes, the many wifis can cause blank pages.
The new betas are pushed to the update server: https://www.madavi.de/sensor/update/data/

The config page isn't changed. Maybe we don't need to shorten this page.

@ricki-z
Copy link
Member

ricki-z commented Sep 13, 2018

@Informatic , @LuchtwachtersDelft just to say that: Many thanks for working on this issue! Solving this problem will help many users.
There are only a few people testing and developing on the firmware right now. So every help is very appreciated.

@LuchtwachtersDelft
Copy link
Contributor

The WiFi channel selection feature seems to work pretty well. I can see with Android app WiFi Analyzer the ESP moves to the best channel. No problem using the full config page.

Once I set a WPA2 password the signal strength decreases, also showing in WiFi Analyzer as a lower curve. And it becomes hard to connect again. So it might indeed be the case that WPA2 adds too much overhead.

@ricki-z
Copy link
Member

ricki-z commented Sep 13, 2018

So we have two possibilities:

  1. Short configpage (Wifi + displays) in AP mode encrypted (WPA) and unencrypted
  2. short page only for WPA, normal config page for unencrypted AP

@LuchtwachtersDelft
Copy link
Contributor

LuchtwachtersDelft commented Sep 13, 2018 via email

@ricki-z
Copy link
Member

ricki-z commented Sep 13, 2018

I have made some more changes (setting PHY layer to 802.11/g, setting max. signal strength) to force the right settings. With my system I can only see a small difference in signal strength between WPA2 and unencrypted AP.
Download https://www.madavi.de/sensor/update/data/

@LuchtwachtersDelft
Copy link
Contributor

LuchtwachtersDelft commented Sep 13, 2018

I just compiled the upstream beta from source and got my first hardware exception

Exception (28):
epc1=0x40250a1f epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000039 depc=0x00000000

ctx: cont
sp: 3fff3480 end: 3fff3760 offset: 01a0

>>>stack>>>
3fff3620:  401063ea 3fff46b1 00000000 3fff46b7
3fff3630:  4023fefe 3fff6a8c 00000484 00000000
3fff3640:  3fff047c 4023facc 00000001 00000001
3fff3650:  3fff1624 3fff272c 00000000 000003fd
3fff3660:  4023fc45 00000001 00000001 3fff1624
3fff3670:  3fff6a8c 4023fcba 00000001 3fff3788
3fff3680:  40217689 00000002 00000002 402175fc
3fff3690:  0000001f 3fff3788 3fff24d0 40212290
3fff36a0:  00000000 00000000 00000000 3fff6a24
3fff36b0:  0000001f 00000017 3fff69fc 0000001f
3fff36c0:  00000017 00000511 00000511 4010020c
3fff36d0:  3fff1624 3fff3788 3fff3700 4010068c
3fff36e0:  00000000 00000000 00000000 3fff272c
3fff36f0:  3fff1624 3fff3788 3fff2600 40213bce
3fff3700:  00000000 00000000 00000000 00000000
3fff3710:  00000000 00000000 feefeffe feefeffe
3fff3720:  feefeffe feefeffe feefeffe feefeffe
3fff3730:  feefeffe feefeffe feefeffe 3fff272c
3fff3740:  3fffdad0 00000000 3fff2724 40222598
3fff3750:  feefeffe feefeffe 3fff2740 40100718
   :Error:28 -> LoadProhibited: CPU tried to load memory from a region which is protected against reads
   0x40250a1f netif_set_down
   :?:::0x401063ea:spi_flash_read
   :?:::0x4023fefe:system_param_load
   :?:::0x4023facc:wifi_get_opmode_default
   :?:::0x4023fc45:wifi_set_broadcast_if
   :?:::0x4023fcba:wifi_set_opmode
   0x40217689 ESP8266WiFiGenericClass
   0x402175fc ESP8266WiFiGenericClass
   0x40212290 connectWifi()
   0x4010020c _umm_free
   0x4010068c free
   :3483 (discriminator 2):::0x40213bce:setup
   0x40222598 loop_wrapper
   0x40100718 cont_norm

And the second time I think it hangs

VMDPV_1|1_Vmounting FS...
mounted file system...
config file not found ...
Starting Webserver... 0.0.0.0
output debug text to displays...
6
Connecting to Freifunk-disabled
............

@LuchtwachtersDelft
Copy link
Contributor

Downloaded the precompiled beta. It works, but on iOS the captive portal isn't activated. Will do some more testing.

@ricki-z
Copy link
Member

ricki-z commented Sep 13, 2018

Very important: power down or hard reset the NodeMCU after flashing. The soft boot doesn't work after flashing. And this may be not the only side effect.
The implementation of the captive portal is the same as in the WifiManager library. There the "not found" page is redirected to the config page in AP mode. But Apple seems to change the behavior from version to version :-( ...

@LuchtwachtersDelft
Copy link
Contributor

LuchtwachtersDelft commented Sep 13, 2018

Woah, the captive portal is finally working on iOS! after using the Forget This Network feature, and toggling WiFi.

Seriously, these kinds of weird behavior don't make developer life any easier...

@LuchtwachtersDelft
Copy link
Contributor

On iOS still seems less stable than on Android. Sometimes the Captive Portal disappears after a few seconds and the iOS Wi-Fi page shows a spinner next to Feinstaubsensor-1234567 to indicate it's scanning instead of being connected. This also happens without WPA and with a mini config page.

I've added

debug_out(server.uri(), DEBUG_MIN_INFO, 1);

to webserver_not_found() to check which URIs are requested when connecting.

scan for wifi networks...

output not found page...
/hotspot-detect.html
validate request auth...
output config page ...
output not found page...
/hotspot-detect.html
validate request auth...
output config page ...
output luftdaten.info logo...
output not found page...
/hotspot-detect.html
validate request auth...
output config page ...
wifi networks found: 26
output config page 2
output config page 3

@LuchtwachtersDelft
Copy link
Contributor

Work in progress on the mini config page...
https:/LuchtwachtersDelft/sensors-software/tree/feature/mini-config-page

@ricki-z
Copy link
Member

ricki-z commented Sep 15, 2018

Latest beta version is online. It includes a minimized config page.

@LuchtwachtersDelft
Copy link
Contributor

Ouch, I didn't see that commit. Oh well. I will try it tomorrow.

@ricki-z
Copy link
Member

ricki-z commented Sep 15, 2018

Changed the captive portal again. I hope that this will work ...

@LuchtwachtersDelft
Copy link
Contributor

I tested the latest beta commit ba93ce3 and the previous c4c9a60 back and forth.

ba93ce3 does not trigger the captive portal even with Forget This Network. Tested several times, even erasing the whole flash before. c4c9a60 does, no need for Forget This Network.

WiFiManager also had this problem. tzapu/WiFiManager#296

@ricki-z
Copy link
Member

ricki-z commented Sep 15, 2018

Changed back last modification on Captive portal.

@LuchtwachtersDelft
Copy link
Contributor

In the list of WiFi networks on the config page there's suddenly one that's just a space and a star. It doesn't show on Android or iOS. Any idea?

WLAN Daten
Netzwerke gefunden: 20

Abcd12345  *   76%
Xyz123456  *   56%
Mnop0987  *    54%   
  *            54%
Rtyu57203  *   50%

And now without *.

Abcd12345  *   68%
               62%
Xyz123456  *   58%

@ricki-z
Copy link
Member

ricki-z commented Sep 16, 2018

I haven't heard until now from such a behavior. Is there a hidden wifi you know about? The scan class will find such wifis but I don't know what name is shown in this case.

@LuchtwachtersDelft
Copy link
Contributor

Could it be related to changes in a recent commit? 7c7f0c2

Using wifiInfo[i].isHidden to print some debug info, the blank SSID turns out to belong to a hidden network.

@ricki-z
Copy link
Member

ricki-z commented Sep 16, 2018

Showing hidden SSIDs wasn't changed. But I will include the mentioned test in the next version to avoid those lines. Excluding hidden SSIDs at scan is not an option as we need the wifi channels and RSSIs occupied by them.

@LuchtwachtersDelft
Copy link
Contributor

I just tried https:/esp8266/Arduino/tree/master/libraries/DNSServer/examples/CaptivePortalAdvanced a couple of times on iOS and eventually it also disconnected, closing the captive portal and showing the spinner on the iOS WiFi page. Looks like there is a deeper problem, maybe in one of the other libraries or even in iOS itself.

@ricki-z
Copy link
Member

ricki-z commented Sep 16, 2018

I think it's iOS. I have read how the try to get the Captive Portal. They try to access different pages, partially with random paths. In some cases the call should give an error, in others not. So the solution "catch all 'not found' and redirect" doesn't really work, but special paths doesn't exist.

@LuchtwachtersDelft
Copy link
Contributor

Still, I'm not fully convinced there's nothing we can do to fix it.

Yesterday I tried adding some check to prevent the captive portal to be triggered multiple times when it's already open. But it didn't seem to prevent the iPhone from disconnecting.

boolean captivePortal() {
	if (!isIp(server.hostHeader())) {
		debug_out(F("Request redirected to captive portal"), DEBUG_MIN_INFO, 1);
		server.sendHeader("Location", String("http://") + IPAddress2String(server.client().localIP()), true);
		server.send(302, "text/plain", ""); // Empty content inhibits Content-length header so we have to close the socket ourselves.
		server.client().stop(); // Stop is needed because we sent no content length
		return true;
	}
	return false;
}
	if (captivePortal()) {
		return;
	}

@LuchtwachtersDelft
Copy link
Contributor

So far so good. The captive portal works on B11 on iOS, using a Javascript redirect instead of 302. Need to test a couple of times more to see if it's really stable.

The hidden SSIDs are still displayed because wifiInfo[i].isHidden should be wifiInfo[indices[i]].isHidden.

@ricki-z
Copy link
Member

ricki-z commented Sep 19, 2018

As I am away this weekend I would like to push the new release tonight.
The hidden SSID should be removed in this version. It's changed on Github, but I haven't pushed a new beta image for this.

@LuchtwachtersDelft
Copy link
Contributor

Let's do it!

After fixing the typo ;)

@ricki-z
Copy link
Member

ricki-z commented Sep 19, 2018

Typo fixed, new release version is online.

@LuchtwachtersDelft
Copy link
Contributor

LuchtwachtersDelft commented Sep 19, 2018 via email

@ricki-z
Copy link
Member

ricki-z commented Sep 19, 2018

For a short time the new files are inaccessable for the server (while copying). At the moment the first sensors are getting the new version (105 sensor right now):
https://www.madavi.de/sensor/versions.php

@LuchtwachtersDelft
Copy link
Contributor

LuchtwachtersDelft commented Sep 19, 2018 via email

@Informatic
Copy link
Contributor Author

Well then. Everything seems to work much better now. Time to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants