Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML aggregate exception #5631

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions src/Microsoft.ML.AutoML/Experiment/Experiment.cs
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@ internal class Experiment<TRunDetail, TMetrics> where TRunDetail : RunDetail
private readonly IRunner<TRunDetail> _runner;
private readonly IList<SuggestedPipelineRunDetail> _history;
private readonly IChannel _logger;

private readonly string _operationCancelledMessage = "OperationCanceledException has been caught after maximum experiment time" +
"was reached, and the running MLContext was stopped. Details: {0}";

private Timer _maxExperimentTimeTimer;
private Timer _mainContextCanceledTimer;
private bool _experimentTimerExpired;
Expand Down Expand Up @@ -192,10 +196,24 @@ public IList<TRunDetail> Execute()
// This exception is thrown when the IHost/MLContext of the trainer is canceled due to
// reaching maximum experiment time. Simply catch this exception and return finished
// iteration results.
_logger.Warning("OperationCanceledException has been caught after maximum experiment time" +
"was reached, and the running MLContext was stopped. Details: {0}", e.Message);
_logger.Warning(_operationCancelledMessage, e.Message);
return iterationResults;
}
catch (AggregateException e)
{
// This exception is thrown when the IHost/MLContext of the trainer is canceled due to
// reaching maximum experiment time. Simply catch this exception and return finished
// iteration results. For some trainers, like FastTree, because training is done in parallel
// in can throw multiple OperationCancelledExceptions. This causes them to be returned as an
// AggregateException and misses the first catch block. This is to handle that case.
if (e.InnerExceptions.All(exception => exception is OperationCanceledException))
{
_logger.Warning(_operationCancelledMessage, e.Message);
return iterationResults;
}

throw;
}
} while (_history.Count < _experimentSettings.MaxModels &&
!_experimentSettings.CancellationToken.IsCancellationRequested &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure how to really add tests for this, It was brought up by a user and you can repro it locally though its not consistent. If any of you have ideas for good tests let me know.

To add a unit test focusing on OperationCanceledException, you could set the AutoML setting to use only FastTree:

var experimentSettings = new RegressionExperimentSettings { MaxExperimentTimeInSeconds = 1 };
experimentSettings.Trainers.Clear();
experimentSettings.Trainers.Add(RegressionTrainer.FastTree);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried that, but it only throws this exception in 1 specific method and this wasn't a stable repro. Not stable enough to try and have a consistent test for it anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. After discussing, even restricting to only FastTree will not make for a stable unit test.

If the timeout occurs in certain steps of FastTree, AggregateException is thrown; while at other steps, OperationCanceledException is thrown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you force it with a test-specific Trainer, or set of trainers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that. The problem is that there is only 1 specific place that causes this issue when a timeout occurs, and I haven't found a way to get it to repro consistently without modifying the method to always throw there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more along the lines of a test-only mock trainer. Ignore if we don't already have a framework / pattern for this, that'd be more of a future-looking test strategy piece of feedback.

!_experimentTimerExpired);
michaelgsharp marked this conversation as resolved.
Show resolved Hide resolved
Expand Down