Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library gets stuck in OPMODE 888 #547

Closed
jpmeijers opened this issue Feb 27, 2020 · 22 comments
Closed

Library gets stuck in OPMODE 888 #547

jpmeijers opened this issue Feb 27, 2020 · 22 comments
Assignees
Labels

Comments

@jpmeijers
Copy link

Description

When using this fork of LMIC my board will get stuck, unable to transmit LoRa messages after either a couple of minutes, but sometimes up to an hour after startup and a successful OTAA join.

I will block for 60 seconds for LMIC to send out data, after which I will give up and continue executing the rest of my code. os_runloop_once(); is still called very frequently, but the LMIC state will remain in 888 (OP_TXRXPEND), even at the start of the next time I want to transmit a message.

setup()

    os_init();
    // LMIC_reset() doesn't affect callbacks, so we can do this first.
    if (! (LMIC_registerRxMessageCb(myRxMessageCb, /* userData */ nullptr) ) ) {
        debugSerial.println("couldn't register callbacks");
    }
    // Reset the MAC state. Session and pending data transfers will be discarded.
    LMIC_reset();
    // 1% clock error
    LMIC_setClockError(1 * MAX_CLOCK_ERROR / 100);
    // Make sure ADR is enabled, even though this is on by default?
    LMIC_setAdrMode(1);
    // Enable link check mode, even though it's on by default
    LMIC_setLinkCheckMode(1);
    // Start at a very slow DR so we can test ADR
    LMIC_setDrTxpow(DR_SF12,18); // OTAA join forces it to SF7BW125, so this has no effect

loop()

  // transmit every minute. Also tested using transmit every 10 minutes with the same issue occurring.
  if(millis()-lastLoraTX > 60000) {
    lastLoraTX = millis();
    buildPacket(); // fill tx buffer
    tx_blocking(radioPacket, 0, 1); // Periodic checkin with empty radio frame
  }

  // Process lmic enough
  os_runloop_once();
}

tx_blocking(data, dataLength, port)

void tx_blocking(uint8_t* radioPacket, uint8_t packetLength, uint8_t port)
{
  uint16_t loopCount=0;
  os_runloop_once();
  do_send(&sendjob, radioPacket, packetLength, port);
  os_runloop_once();
  print_lmic_opmode();
  uint32_t start = millis();
  uint32_t prevPrint = 0;

  // Do a OTAA joing if we haven't done it yet.
  while( LMIC.opmode & OP_JOINING ) {
    loopCount++;
    os_runloop_once();
    if(millis() - prevPrint > 10000) { // print the opmode every ten seconds while we try to join
      prevPrint = millis();
      print_lmic_opmode();
    }
  }

  // Wait for tx data to be sent out, or timeout after 60 seconds
  while( (LMIC.opmode & OP_TXDATA) && (millis() - start < TX_TIMEOUT) )
  {
    loopCount++;
    os_runloop_once();
    // print the opmode in hex every second
    if(millis() - prevPrint > 1000) {
      prevPrint = millis();
      print_lmic_opmode();
    }
  }

  // For good measure make sure we have processed everything we need.
  os_runloop_once();
  print_lmic_opmode();
  os_runloop_once();
  print_lmic_opmode();
}

Serial output

image

Environment

  • Version of LMIC being used: 3.1.0
  • Version of Arduino IDE being used: 1.8.11
  • Network provider: The Things Network
  • Region: EU868
  • Board: Arduino Zero compatible
  • CPU: ATSAM D21G18
  • Radio: HopeRF RFM95
@terrillmoore
Copy link
Member

Thanks for the report @jpmeijers

If you can get a link map, and then print out the contents of the LMIC job queues, that will help determine the state of the LMIC, which will in turn help figure out where we got stuck. The contents of osjob_t are three fields: next, deadline, and func; I need all three. If you also get me the link map of the firmware for which you reproed the failure, I can find out which function is scheduled.

It's a little awkward; you need to add a helper utility in oslmic.c, because the key structure is static and therefore hidden.

In oslmic.c:

osjob_t ** os_getJobQueue(bit_t fTimed) {
  return fTimed ?  &OS.scheduledjobs : &OS.runnablejobs;
}

Then in your sketch:

extern "C" { osjob_t ** os_getJobQueue(bit_t); }

if (failure) {
   Serial.print("now="); Serial.println(os_getTime());
   for (bit_t fType = 0; f < 2; ++fType) {
       auto pHead = os_getJobQueue(fType); 
       Serial.println(fType ? "runnable queue:" : "scheduled queue:");
       for (; pHead != NULL; pHead = pHead->next) {
              Seral.print("0x"); Serial.print(pHead, HEX);
              Serial.print(": .job=0x"); Serial.print((uintptr_t) pHead->job);
              Serial.println(", deadline="); Serial.print(pHead->deadline);
       }
   }
}

Which BSP are you using, by the way? The Arduino Zero BSP?

@jpmeijers
Copy link
Author

jpmeijers commented Feb 29, 2020

Strangely enough I get a different issue when running the device on my desk. I can't yet reproduce the stuck in mode 888 which I've seen in the field yet.

After running for between 30 and 60 minutes the link dead state was detected:

14:58:11.481 -> now=282862012
14:58:11.514 -> scheduled queue:
14:58:11.514 -> runnable queue:
14:58:11.547 -> 
14:58:11.581 -> 282867209: EV_TXSTART
14:58:11.614 -> Packet queued
14:58:11.614 -> LMIC.opmode = 888
14:58:11.647 -> TX user data (buffered in pendTxData)
14:58:11.680 -> TX/RX transaction pending
14:58:11.713 -> find a new channel
14:58:11.746 -> LMIC.opmode = 888
14:58:11.746 -> TX user data (buffered in pendTxData)
14:58:11.813 -> TX/RX transaction pending
14:58:11.813 -> find a new channel
14:58:12.742 -> LMIC.opmode = 888
14:58:12.775 -> TX user data (buffered in pendTxData)
14:58:12.808 -> TX/RX transaction pending
14:58:12.841 -> find a new channel
14:58:13.737 -> LMIC.opmode = 888
14:58:13.770 -> TX user data (buffered in pendTxData)
14:58:13.804 -> TX/RX transaction pending
14:58:13.837 -> find a new channel
14:58:13.870 -> 283010628: EV_TXCOMPLETE (includes waiting for RX windows)
14:58:13.936 -> 283014675: EV_LINK_DEAD
14:58:13.970 -> LMIC.opmode = 1900
14:58:13.970 -> prevent TX lining up after a beacon
14:58:14.036 -> find a new channel
14:58:14.036 -> link was reported as dead
14:58:14.069 -> LMIC.opmode = 1900
14:58:14.102 -> prevent TX lining up after a beacon
14:58:14.136 -> find a new channel
14:58:14.169 -> link was reported as dead

A while later I started seeing something else happening. Every time I enqueue a packet for TX, LMIC goes into the "prevent TX lining up after beacon" mode. This can take 2 minutes before it passed and the TX is performed.

16:22:55.184 -> now=600579353
16:22:55.184 -> scheduled queue:
16:22:55.217 -> runnable queue:
16:22:55.251 -> 
16:22:55.251 -> Packet queued
16:22:55.284 -> LMIC.opmode = 1108
.
.
.
17:08:34.948 -> LMIC.opmode = 1900
17:08:34.981 -> prevent TX lining up after a beacon
17:08:35.014 -> find a new channel
17:08:35.048 -> link was reported as dead
17:08:35.147 -> 
17:08:35.147 -> == Transmitting packet 37 bytes
17:08:35.180 -> 
17:08:35.180 -> now=771814932
17:08:35.214 -> scheduled queue:
17:08:35.247 -> runnable queue:
17:08:35.247 -> 
17:08:35.247 -> Packet queued
17:08:35.280 -> LMIC.opmode = 1108
17:08:35.313 -> TX user data (buffered in pendTxData)
17:08:35.347 -> prevent TX lining up after a beacon
17:08:35.380 -> link was reported as dead
.
.
.
17:09:35.691 -> now=775595585
17:09:35.691 -> scheduled queue:
17:09:35.724 -> runnable queue:
17:09:35.724 -> 0x1CA8: .job=0x7250, deadline=779793771

and only two minutes later

.
.
.
17:10:36.589 -> now=779402346
17:10:36.589 -> scheduled queue:
17:10:36.622 -> runnable queue:
17:10:36.622 -> 0x1CA8: .job=0x7250, deadline=779793771
.
.
.
17:10:40.872 -> TX user data (buffered in pendTxData)
17:10:40.905 -> prevent TX lining up after a beacon
17:10:40.938 -> link was reported as dead
17:10:41.868 -> LMIC.opmode = 1108
17:10:41.868 -> TX user data (buffered in pendTxData)
17:10:41.934 -> prevent TX lining up after a beacon
17:10:41.968 -> link was reported as dead
17:10:42.864 -> 779795139: EV_TXSTART
17:10:42.897 -> LMIC.opmode = 1888
17:10:42.897 -> TX user data (buffered in pendTxData)
17:10:42.964 -> TX/RX transaction pending
17:10:42.997 -> find a new channel
17:10:42.997 -> link was reported as dead
17:10:43.893 -> LMIC.opmode = 1888
17:10:43.926 -> TX user data (buffered in pendTxData)
17:10:43.959 -> TX/RX transaction pending
17:10:43.993 -> find a new channel
17:10:44.026 -> link was reported as dead
17:10:44.889 -> LMIC.opmode = 1888
17:10:44.922 -> TX user data (buffered in pendTxData)
17:10:44.956 -> TX/RX transaction pending
17:10:44.989 -> find a new channel
17:10:45.022 -> link was reported as dead
17:10:45.885 -> LMIC.opmode = 1888
17:10:45.919 -> TX user data (buffered in pendTxData)
17:10:45.952 -> TX/RX transaction pending
17:10:45.985 -> find a new channel
17:10:46.018 -> link was reported as dead
17:10:46.284 -> 780006492: EV_TXCOMPLETE (includes waiting for RX windows)
17:10:46.317 -> LMIC.opmode = 1900
17:10:46.350 -> prevent TX lining up after a beacon
17:10:46.383 -> find a new channel
17:10:46.417 -> link was reported as dead
17:10:46.450 -> LMIC.opmode = 1900
17:10:46.450 -> prevent TX lining up after a beacon
17:10:46.516 -> find a new channel
17:10:46.516 -> link was reported as dead
17:10:47.081 -> Uptime: 12480909
17:10:48.044 -> Uptime: 12481910
17:10:49.074 -> Uptime: 12482912
17:10:50.069 -> Uptime: 12483913
17:10:51.066 -> Uptime: 12484915
17:10:52.063 -> Uptime: 12485916
17:10:53.059 -> Uptime: 12486918
17:10:54.089 -> Uptime: 12487919
17:10:55.085 -> Uptime: 12488921
17:10:55.616 -> 
17:10:55.616 -> == Transmitting packet 37 bytes
17:10:55.650 -> 
17:10:55.650 -> now=780594111
17:10:55.683 -> scheduled queue:
17:10:55.683 -> runnable queue:
17:10:55.716 -> 
17:10:55.749 -> Packet queued
17:10:55.749 -> LMIC.opmode = 1108
17:10:55.782 -> TX user data (buffered in pendTxData)
17:10:55.815 -> prevent TX lining up after a beacon
17:10:55.849 -> link was reported as dead

@jpmeijers
Copy link
Author

Thinking about this, it could be that I'm running at SF12 now, and then being blocked from the channel for 2 minutes is correct. So this might not yet be the issue I'm looking for.

@terrillmoore
Copy link
Member

Thanks for testing.

EV_LINK_DEAD means that LMIC is in link-integrity checking mode, and the device didn't get any downlinks in the required number of uplinks. The gateway is also limited in terms of bandwidth allocation; it's not hard to run it out of bandwidth (so it can't send downlinks). That might be part of the problem. Or the gateway could be "half-open" (uplinks working but downlinks not working). The TTN console is not a reliable indication of downlinks working or not (it will say it has sent them when the gateway in fact didn't get them, or didn't get them in time).

When you say "lining up after beacon" what are you referring to? Are you running in Class B mode? In Class A, the delays between packets are inserted due to duty-cycle limitations. I think you're referring to the comment on OP_RNDTX which is misleading. OP_RNDTX is used to randomize the transmission time. It's set by txDelay(), which is called several places. It causes the next transmit to delay until the time reaches the value in LMIC.globalDutyAvail

Here are the conditions:

  1. After receiving data during RX2, to randomize uplink (randomization of 2 seconds)
  2. After a failed confirmed uplink, to give half-duplex gateways time to recover before we try again (randomization of 3 seconds).
  3. After all but the last of repeated uplinks (if LMIC.upRepeat > 0) (no randomization)
  4. After a successful downlink (some overlap with LMIC asserts after too long trying to join #1 above) (no randomization)
  5. After a missed beacon (with randomization of 4 seconds)
  6. After a received beacon (with randomization of 16 seconds)

You might want to add LMIC.globalDutyRate and LMIC.globalDutyAvail to your print out.

@jpmeijers
Copy link
Author

I let the previous code run since yesterday, and it stopped running at 20:51 (local time).

20:15:46.445 -> Uptime: 23578066
20:15:47.441 -> Uptime: 23579068
20:15:48.437 -> Uptime: 23580069
20:15:48.537 -> 
20:15:48.537 -> == Packet TX 37 bytes
20:15:48.570 -> 
20:15:48.570 -> now=1473762836
20:15:48.603 -> scheduled queue:
20:15:48.603 -> runnable queue:
20:15:48.637 -> 
20:15:48.670 -> 1473768910: EV_TXSTART
20:15:48.703 -> Packet queued
20:15:48.703 -> LMIC.opmode = 888
20:15:48.736 -> TX user data (buffered in pendTxData)
20:15:48.770 -> TX/RX transaction pending
20:15:48.803 -> find a new channel
20:15:48.836 -> LMIC.opmode = 888
20:15:48.869 -> TX user data (buffered in pendTxData)
20:15:48.902 -> TX/RX transaction pending
20:15:48.936 -> find a new channel

The log output stopped at 20:15:48, and the last packet was also received by TTN at the same time. From the code that I posted originally it seems like we are stuck in my tx_blocking function, and specifically inside:

  // Wait for tx data to be sent out, or timeout after 60 seconds
  while( (LMIC.opmode & OP_TXDATA) && (millis() - start < TX_TIMEOUT) )
  {
    loopCount++;
    os_runloop_once();
    // print the opmode in hex every second
    if(millis() - prevPrint > 1000) {
      prevPrint = millis();
      print_lmic_opmode();
    }
  }

Could it be that os_runloop_once() is blocking?

@jpmeijers
Copy link
Author

Yes, I believe I'm reproducing the issue I saw in the field. os_runloop_once() gets stuck.

After the previous lockup I reset the board, after which it ran just over 2 hours before it locked up again.

Lockup2hours.txt

@terrillmoore
Copy link
Member

terrillmoore commented Mar 1, 2020

Well, no, there's no loop in os_runloop_once() -- it's purely sequential, no iteration at all.

I suggest you add (after incrementing loopCount):

if (loopCount % 1000) { // <--- adjust the 100000 as needed 
  // consider this, too: 
  // Serial.print(millis());  -- this will show whether time is changing but limited to a small range
  Serial.print('.');   // <=== this is just to make the code path visible. Proves that we're not stuck.
}

It is possible that time is not advancing, which could happen because interrupts are disabled. Also, the fact that you're stuck in the while loop is a strong indication that time is not advancing; otherwise the second part of the && would fail, and you'd exit the loop.

@jpmeijers
Copy link
Author

So far the debugging output still points to os_runloop_once() that hangs.

My current theory is that it's brownouts that is causing either a lockup or reset of the board sometimes when the LoRa radio transmits.

On some boards I have brownout disabled, which will explain the board locking up when a brownout happens. On other boards I have seen unexplained resets, which can also be explained by brownout detection resetting the board.

I however still do not know why I only started to see this recently. Maybe all the fixes applied to LMiC also made the radio consume more power. My boards clearly do not have good enough filtering on Vcc, and it seems like my current batch Duracell AA batteries are not as good as they should be. Be weary of 12-pack "special offer" batteries.

I have now added an extra 10µF cap just after the batteries. I'll report back in a couple of hours if that helped or not.

@terrillmoore
Copy link
Member

Any news?

Certainly, the LMIC will do more back-to-back transmits, because if there's a downlink that needs an ack, it will immediately reply (2.3.2 LMIC would very likely ignore it). I can see that there's a need for power management in this area, so perhaps we can teach the LMIC to be less aggressive. However, I think we should first move the LMIC to an explicit FSM (rather than an implicit one defined by the callback sequence). (And first of all, of course, fix #524!)

@jpmeijers
Copy link
Author

It's difficult to say if it's still occurring or not. At least it's much less with the extra cap on Vcc than before.

I had a board running fine on a serial terminal logging the output until just above 813 million milliseconds. The logging however stopped due to load shedding (power outage) and the laptop battery running out. The board however did lock up at some point after that because there is an LED that should flash that just remained on. So I still can't report anything.

I was unaware of #524. My code depends a lot on interrupts, especially the sleep logic that uses the WDT ISR. Those could quite possibly interfere.

I totally agree about using an FSM. It does however sound like a lot of extra work to change.

@jpmeijers
Copy link
Author

I'm once again trying to debug this further. Even with all the extra caps (currently 220uF on the RFM95) I still get that the board locks up while running os_runloop_once(). I made sure brownout detection is switched on at 1.8V to be sure it's not a brownout causing a lockup. What I can also do is to enable the WDT which does reset the board when it gets locked up, but then I need to perform a re-join.

I also tried pushing the SPI frequency down to 100kHz without any positive effect. It maybe even locked up during an earlier transmit.

Also I noticed that the lockup happens during the 3rd call to os_runloop_once after enqueuing a message for transmit.

@altishchenko
Copy link

Hello guys, sorry to barge in like that. But I have a few questions:

  1. Do you have LMIC_USE_INTERRUPTS defined in your code?
  2. What happens, if you put a small delay of, say, 10 milliseconds in each of your loops between os_runloop_once() calls?
  3. Can it be so, (never dug so deep yet in this stack implementation) that constant OP_NEXTCHNL flag in the opmode indicates not only the requirement to find a new channel, but also an inability to find one free?

ARM Cortex-M0+ based micros() and delay() are based on a SysTick interrupt counter. If you don't allow interrupts often and for long enough you don't have correct micros. They will change (hardware register changes), but not the millis part (interrupt counter), so you can easily even happen in the past from time to time.

My "beast" is STM32L051C8x based custom made board with sx1276 as a modem. Never saw any lockups as you describe them, but my calls to os_runloop_once() are spaced apart more than 10ms away and I govern the LoRa JOIN/TX/RX/DutyCycle states from outside os_runloop, thanks to events callback.
@terrillmoore What I'd like to suggest, is to attempt piggybacking MAC replies onto application provided data frames, not generating 'empty data' ones. Say, with reasonable timeout...

@jpmeijers
Copy link
Author

Thanks for the comments @altishchenko. Definitely a few things to think about there. Funny thing is that I have this issue on both a AT SAMD21 and a ATmega 1284p board.

Roughly what I found on the ATmega1284p was the following:
WDT enabled: will restart after an uptime of between 2000s and 4000s
WDT disabled: board will lock up between 7000s and 8000s

As a sanity check I swapped out LMIC for the Matthijs Kooijman one - trying to change as little as possible to my code. I did this yesterday and so far the board has been running for 63000s.

@altishchenko
Copy link

@jpmeijers While debugging this issue can you still try delay()? Even delay(2) might do the trick. This is needed to make sure that disable irqs all too often for a long period of time may be the cause of the resulting race condition and lockup.
What about LMIC_USE_INTERRUPTS?

@jpmeijers
Copy link
Author

I've added 10ms delays between the os_runloop_once() calls. So far the board is running for 68000s without an issue.

@altishchenko
Copy link

@jpmeijers , @terrillmoore
Thank you, let's wait.
Until then: in the current library implementation a rare possibility exists that on invocation of os_runloop_once() the board CPU may be reset or get locked up, depending on the RTE. Possibility arises from the race condition where chosen to run job gets unscheduled, re-scheduled or LMIC gets reset due to interrupt handler invocation. Here is the sequence of events:

  1. os_runloop_once is entered
  2. interrupts disabled
  3. job X is chosen as a run candidate
  4. interrupts enabled
  5. interrupt handler unschedules or reschedules the job X (LMIC.osjob) or performs LMIC_reset()
  6. job X gets invoked - incorrect behavior, result depends on RTE.

In my RTE this leads to HardFault() exception and CPU reset, your mileage may vary.

As a quick measure, I suggest adding the following lines to os_runloop_once(), modifying osjob_t accordingly:

void os_runloop_once() {
    osjob_t* j = NULL;
    hal_disableIRQs();
    // check for runnable jobs
    if(OS.runnablejobs) {
        j = OS.runnablejobs;
        j->ready = true;    // XXX: Mark Job as ready to run
        OS.runnablejobs = j->next;
    } else if(OS.scheduledjobs && hal_checkTimer(OS.scheduledjobs->deadline)) { // check for expired timed jobs
        j = OS.scheduledjobs;
        j->ready = true;    // XXX: Mark Job as ready to run
        OS.scheduledjobs = j->next;
    } else { // nothing pending
        hal_unlock(); // Allow CPU going low power
    }
    hal_enableIRQs();
    if(j && j->ready) { // XXX: we have a job and it was not modified by interrupts
        j->func(j);
    }
}

Here flag j->ready must be reset to false in unlinkjob() function call.

@altishchenko
Copy link

@jpmeijers How is your board doing? Still up and running?

@jpmeijers
Copy link
Author

I believe it's positive. The board runs for hours on end without locking up. I however get a brown out detection restart sometimes. I haven't seen the brown out on the kooijman library. And considering I have 220uF on the radio's Vcc this is very strange.

@jpmeijers
Copy link
Author

jpmeijers commented Apr 15, 2020

I've enabled LMIC_USE_INTERRUPTS today. With that enabled my board does not want to join the network. I double check the pinout of the three DIOs and buzzed them through to make sure they are connected. DIO=PB40, DIO1=PB3, DIO2=PB2 on the ATmega1284p. These pins all have PCINTs associated with them.

Switching LMIC_USE_INTERRUPTS off makes the board join again.

Update:
Loading Kooijmans LMIC back on. As far as I know this one uses interrupts, and it joins fine.

@altishchenko
Copy link

@jpmeijers Good it is working somehow, bad about interrupts - can't help you there. I am running on a custom made dedicated module where every pin is at my command. Though, I have this: for some reason with interrupts enabled (and I have them always on) my module refuses to join at the default JoinDr of the library (misses the window most probably).
As I am running EU868, so my default JoinDr is in lmic_bandplan_eu868.h, line 64 (LMICbandplan_getInitialDrJoin). Had to change JoinDr to SF12 to get stable results. This change does not affect data exchange, only Joins.

BOR you are getting is on a main CPU or on a modem? Or CPU BOR caused by a modem? Could it be somewhere around TCXO wakeup times, if it is there at all? My modem is SX1276 and by my board design it usually requires around 2ms to come up to senses, so I have this in my code:

ostime_t hal_setModuleActive(bit_t val)
{
    /* Function switches TCXO on and off, returns number of osticks to wait */
    // setModuleActive() takes a c++ bool, so
    // it effectively says "val != 0". We
    // don't have to.
    /* SX1276 needs about 5ms to "wake up" */
    return val ? ms2osticksCeil(2): 0;
}

@altishchenko
Copy link

@jpmeijers The other library you use does not use interrupts.

@jpmeijers
Copy link
Author

Closing issue for now as many things changed between the OP and the current v3.2.0 version.
New issue: #642

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants