Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.7 Crashes, log stops in-air #9271

Closed
RomanBapst opened this issue Apr 9, 2018 · 20 comments
Closed

1.7 Crashes, log stops in-air #9271

RomanBapst opened this issue Apr 9, 2018 · 20 comments

Comments

@RomanBapst
Copy link
Contributor

RomanBapst commented Apr 9, 2018

This issue represents all the crash reports we've seen on 1.7 where the log stops in-air.

#9108
#9139
#9260
#8263
#9185

@tops4u
Copy link
Contributor

tops4u commented Apr 10, 2018

Maybe #8263?

@RomanBapst
Copy link
Contributor Author

@tops4u Thanks, I was not aware of that

@jingego
Copy link
Contributor

jingego commented Apr 11, 2018

also #9185.

@JohnSnowball
Copy link

I hope someone could take it seriouly, as I suffered 3 crashes like this. If flight goes well, log is fine as well, however if it crashes, the log could be stopped in mid-air. All recorded data is fine, attitude\altitude\sensors... but no data during crash. I couldn't find any clue in the log which might lead to crash.
This make me think that it could be NUTTX problem, but it's just an assumption without support.

@RomanBapst
Copy link
Contributor Author

@JohnSnowball I think we only have one of your log files, could you also share the other two?
Furthermore, could you give us, as far as possible, a detailed description of what you did and what happened? Some details might be important e.g. were the motors still running and appearing to do something prior to the crash...

@JohnSnowball
Copy link

JohnSnowball commented Apr 21, 2018

First one, the main reason could be quad motor fail which leads to rotation, however the log stops in air and crashed.
https://logs.px4.io/plot_app?log=08af06f9-c1f7-4ab1-b6ba-d2ac9e13f1c3
Second one, whole flight lasted for about 30mins however the log stopped in the first 10 mins . The plane kept flying in air until the battery ran out! I kept track for the plane in ground control station the whole time, but, for the last 20 mins, the plane kept flying (flight path looked like an "8") and rejected all the commands (data link NOT lost, return\landing\open parachute\manual control by virtual joystick all tried but no reaction). It acts like it's "deaf". At the end, the battery ran out and crashed.
https://logs.px4.io/plot_app?log=f8366570-444a-4dee-ba13-03c5f811b1c6

For these two flights, I'm 100% sure I have done nothing special. Two planes have already had about more than 5 normal flights. The same thing they share is, it's very warm outside, (about 35 celcius degree). Is it possible that the system is down?
Please let meknow if I can help more!
@RomanBapst

@LorenzMeier
Copy link
Member

@JohnSnowball I'm looking into it right now. Those last logs are helpful.

@RomanBapst
Copy link
Contributor Author

@JohnSnowball Thanks! Do you also happen to have the mavlink telemetry log for the flight?

@davids5
Copy link
Member

davids5 commented Apr 23, 2018

@LorenzMeier

Please keep in mind that I am working from just an elf and current master. So the symbols I am seeing may be bogus.

The hardfault log you sent me indicates a bad PC and corruption on the user stack. As I look up the stack for possible calling code I see 0x0801d9ff in sem_timeout possible called by 0x080f5ef5 in Bq78350::~Bq78350()

Is that driver used on the craft? Where there changes to it?

@LorenzMeier
Copy link
Member

I believe it is.

@philipoe This is a custom, non-contributed driver and there is no way for us to tell if that's at fault nor can we debug it. If you want support for that peripheral and debugging in situations like this one you would need to make the driver and hardware available (= commercially available or as open hardware). That is a general rule for upstream debugging.

I'm closing the issue with a tentative conclusion that the fault is due to user-changed code and not an actual PX4 issue.

@LorenzMeier
Copy link
Member

Sorry, closing the right one.

@bkueng
Copy link
Member

bkueng commented Apr 27, 2018

@JohnSnowball

Second one, whole flight lasted for about 30mins however the log stopped in the first 10 mins . The plane kept flying in air until the battery ran out! I kept track for the plane in ground control station the whole time, but, for the last 20 mins, the plane kept flying (flight path looked like an "8") and rejected all the commands (data link NOT lost, return\landing\open parachute\manual control by virtual joystick all tried but no reaction). It acts like it's "deaf". At the end, the battery ran out and crashed.

Based on this I can infer:

  • logger stops and mavlink still runs (at least the sender)
  • since mavlink runs at lower prio than logger, it cannot be that a high-prio task runs busy and blocks the logger
  • it seems to me that it is something more low-level, like a memory corruption that causes NuttX to misbehave, or that the SD card access starts to block for some reason.

I tested with HIL and a VTOL model what happens when the SD card access starts to block during a mission, with similar, but not quite the same results as @JohnSnowball experienced:

  • logging obviously stops
  • navigator blocks too after reaching the next waypoint, trying to access dataman. The vehicle starts to loiter around the last waypoint (flying a circle, as opposed to an "8")
  • RTL does not work anymore
  • mavlink connection still works, including mode switching and virtual joystick. However when QGC loads the mission from the vehicle (for example by restarting QGC and switching to the plan view), the mavlink receiver starts to block, and you cannot switch modes via QGC anymore, but you still receive the mavlink stream (datalink not lost).

Do you have the mavlink log for this crash?

@JohnSnowball
Copy link

@bkueng thanks man!!!!thanks very much!!! I'm 100% sure that's what I got! That's exactly the same!
And before flight, uploading mission waypoints took longger time than before and I got several uploading failure.
I have the mavlink log file, tell me how and I will share with you.

Anyway, it is a serious bug, SD card hard fault shouldn't not lead to this kind of behavior!
However, I couldn't solve it, could anyone find a way to fix this?

@JohnSnowball
Copy link

@bkueng by the way, how to do HIL with PX4?

@RomanBapst
Copy link
Contributor Author

@JohnSnowball You can find instructions here
https://dev.px4.io/en/simulation/hitl.html

Additionally, there is a pull request which enables HIL for VTOL (standard vtol). @bkueng successfully tested it yesterday and I want to merge it soon.
#9276

@bkueng
Copy link
Member

bkueng commented Apr 30, 2018

I have the mavlink log file, tell me how and I will share with you.

You can zip it and you should be able to directly upload it here to github.

However, I couldn't solve it, could anyone find a way to fix this?

We're still trying to find the root cause, and will report the progress here.

@JohnSnowball
Copy link

2018-04-13 16-40-31.zip
Here is the tlog for that crash. Hope this could help.

@RomanBapst
Copy link
Contributor Author

@JohnSnowball Thanks for sharing the tlog! Do you happen to have the tlogs of the entire flight? The one you sent only embodies the data when the plane was not responding anymore.

We have been looking at the message which have been streamed to QGC during the incident:
mav_inspector

Interestingly, we cannot find a mavlink stream config that matches the message profile seen in the mavlink inspector. E.g. we cannot see any LOCAL_POSITION_NED or ESTIMATOR_STATUS messages.

Questions:

  • are you using a custom mavlink message streaming profile?
  • what radio are you using to connect to QGC and which port are you using? (telem 1, telem2)
  • do you have code changes compared to master?

@JohnSnowball
Copy link

Thanks for your response. Here are my answers:

  1. Tlog is not complete. At first we use an ipad, then when plane seems "deaf", we switched to a notebook. The tlog saved in ipad is missed(ipad borken).
  2. What I use for mavlink message is shown below, and I understand now all those messages are missed....
    image
  3. I'm using microhard P900 on telem1
  4. Many code changed, but mostly in other modules(self-defined modules like RTK), I merely touched code in main modules.

@LorenzMeier
Copy link
Member

The conclusion here is that we found that all instances could be explained with systems crashing before the log could be written. We have robustified it to the extent possible. However, to be really robust a system needs to have a standby battery for the autopilot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants