-
-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Lambda timeouts and PHP-FPM crashes better #862
Comments
@evoaran we are experiencing this issue as well. It fixes itself i think when lambda spawn another micro container instance. |
Hi! I haven't forgotten about this, I'm just short of ideas on what to do here. FYI there's also #842 that could be related, it may be worth giving it a look.
Right, as @matthewsuan mentioned this might be a case of the Lambda instance that is in a "broken" state is replaced by another Lambda instance. Now the questions would be:
Judging from #842 and past experience, could the initial failures be triggered by a large file upload or large HTTP response? |
Hi @mnapoli, @matthewsuan, After combing through the logs again this evening, I think I've found the cause that starts it each time - a Lambda timeout:
What I'm still not sure of though, is what caused the initial timeout, it's just a normal HTTP request and our dependant services for this request example (Redis, Memcached & MySQL RDS) are all running perfectly at this time and have connection timeouts low enough we'd never reach this timeout. To address your specific question, we're seeing a mix of GET and POST requests starting this issue and we don't process any file uploads via Lambda so can rule that out too. HTTP responses aren't "large", 45kb for the largest responses we're seeing being affected. Could it just be the Lambda host has died perhaps? On a slightly different topic, would you be happy to accept a PR to change the Cloudwatch log format for Bref to convert these FPM related messages to include a JSON payload which can have more context details (Lambda Request ID and API Gateway request ID - where applicable) as this would really help to try and tie various log entries together when issues like this occur. |
I went and checked logs and these same errors happen during occasional timeouts (28 secs) as well though I expect timeouts from time to time with my application. I think when this happens, FPM runtime was not able to properly shutdown during |
Thanks for those details! So we might need to adjust the current recovery mechanism in case of Lambda timeouts: bref/src/Event/Http/FpmHandler.php Lines 214 to 258 in e94df97
Maybe we need to send a stronger "kill" signal in case we can kill PHP-FPM here: bref/src/Event/Http/FpmHandler.php Line 273 in e94df97
Any other idea?
Do you mean to change the format of all the logs written by PHP applications? Or specifically the logs written by Bref when managing PHP-FPM? In short, I'm 👍 on improving Bref-related exceptions and logs. But changing the formats of all application logs to JSON might be a too big of a change. |
Completely agree, perhaps we look at swapping out this... bref/src/Event/Http/FpmHandler.php Line 249 in e94df97
...to use KILL (9) rather than TERM (15) to ensure it actually terminates regardless of what might be going on behind the scenes. Regarding the logs, I was purely thinking about the FPM-related logging to help track down what's going on here more easily so I'll take a look at a PR to add this in shortly. |
Do you think it might be worthwhile stipulating a timeout on FPM that's a little lower than the Lambda timeout in order to allow a little extra time for everything to wrap up successfully? https:/brefphp/bref/blob/master/src/Event/Http/FpmHandler.php#L79 |
I think we are seeing this issue. We have it where our uptime check fails (within 2 seconds), this seems to occur when other requests have timed out so I figure it's the same issue where php-fpm is in a bad state when the runtime is being reused by our uptime ping request. Has anyone been able to work around this issue? |
We still see it following a timeout (normally after RDS Serverless fails to scale) |
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
This PR may help to handle timeouts with FPM better, including flushing all logs (and possibly restarting the child FPM worker in a cleaner way): #1133 |
Since its been about a year since the last reports, and we have merged #1133 too, I'll go ahead and close this one. |
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
One of our Lambda functions running in eu-west-2 started getting intermittant errors logging in Cloudwatch stating:
"Error: signal: terminated Extension.Crash"
When this occurs, none of our application code is executed and the request just fails after ~2 seconds with API Gateway then returning a 500 error on the affected requests.
Our function only runs 3 layers and has done so for months, the issues started for 2 hours on 2021-02-08, then again around the same time in the evening last night 2021-02-09:
For clarity, there have been no changes to the application or environment by ourselves during these times and we have no time-based logic that would account for the issues starting in the evening for a few hours two days running.
Not sure if this is a Bref / Bref Extra issue or an AWS one, the Lambda team have investigated and believe it is to do with the layers, so any help/guideance to track things down further would be greatly appreciated.
The text was updated successfully, but these errors were encountered: