Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1672951 - Implement a Metrics Ping Scheduler in glean-core #1599

Merged
merged 3 commits into from
May 5, 2021

Conversation

chutten
Copy link
Contributor

@chutten chutten commented Apr 28, 2021

The MPS has actually been written for a little while. I strove for as direct a translation of MetricsPingScheduler.{kt|swift} as was practicable (we can always refactor it later). The hard part has been tests.

I've tried making this testable in a few ways. The first way you can see in the second commit and is straightforward, but it can't actually observe most of the behaviour we hope to test.

The second way apes the Policy approach of gecko JSMs. (I had some struggles with the borrow checker on this approach, so it's very possible this is more ornate than it needs to be) I think this Policy approach is the better one because it allows us to inspect and test every operation of the MPS which, as history has taught us, is absolutely crucial for peace of mind.

(( I also attempted to automock Glean, but it would require some serious modifications to make the Glean struct's impl suit the mocking library (mockall), so I abandoned it. ))

I'm looking for some honest feedback about this, both for whether the approach itself is distasteful as well as what patterns I've missed in pursuing things in this manner. I've taken heart from the Book suggesting RefCell-based interior mutability for test mocks itself to mean I'm not completely out to lunch here, but I might be.

Copy link
Member

@badboy badboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general approach is sound and indeed a 1:1 mapping of the "old" implementation.
I think if we treat this implementation as "experimental" (maybe even put it behind a feature with that name?) we can consider some (breaking) changes on it as followups.

I like the testing approach, though might bikeshed on the Policy name (but don't have a better one yet).
(and of course we need to add some more tests)

I have a bunch of comments I left inline.

@@ -129,6 +134,11 @@ pub struct Configuration {
pub max_events: Option<usize>,
/// Whether Glean should delay persistence of data from metrics with ping lifetime.
pub delay_ping_lifetime_io: bool,
/// The application's build identifier. If this changes and use_core_mps is `true`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes when?

/// Must be called before draining the preinit queue.
/// (We're at the Language Bindings' mercy for that)
pub fn schedule(glean: &Glean) {
let now = Local::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is what's currently causing some crashes in FOG and it's why we have this awkward workaround now: https:/mozilla/glean/blob/main/glean-core/src/util.rs#L53-L89

We should probably rely on the same workaround (probably possible to split local_now_with_offset into local_now and local_now_with_offset which calls local_now and turns DateTime<Local> into DateTime<FixedOffset>)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but won't we have problems with any constructed value of DateTime<Local> no matter what? I mean, if Local's timezone is problematic, there's not really much we can do with it.

I guess I'll rewrite scheduler to use DateTime<FixedOffset> and use local_now_with_offset to get it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the wrong timezone consistently then it's still fine.
But we shouldn't crash when encountering something wrong.

Using DateTime<FixedOffset> throughout should be still correct usage.

glean-core/src/scheduler.rs Outdated Show resolved Hide resolved
if let Some(last_sent_build) = last_sent_build_metric.get_value(&glean, INTERNAL_STORAGE) {
// XXX: If `app_build` is longer than StringMetric's max length, we will always
// treat it as a changed build when really it isn't.
if &last_sent_build != glean.app_build.as_ref().unwrap() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we always have Some app_build here?
Also we use it below again, so maybe put that into a variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have app_build iff cfg.use_core_mps. But I'm not happy with how it's only enforced by logic not language. cfg.app_build is always present (often "unknown"), I was using Option to signal more than the app_build... maybe I should add use_core_mps to struct Glean?

// treat it as a changed build when really it isn't.
if &last_sent_build != glean.app_build.as_ref().unwrap() {
last_sent_build_metric.set(&glean, glean.app_build.as_ref().unwrap());
log::info!("Builds don't match. Sending 'metrics' ping");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something more like "App build changed. Sending 'metrics' ping"?

Local::today().and_hms(SCHEDULED_HOUR, 0, 0)
};

// Other MPSes cancel outstanding tasks here. I'm not sure we need to, since schedule() is only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO or bug for a followup


const SCHEDULED_HOUR: u32 = 4;

static THREAD_GENERATION: OnceCell<Arc<AtomicU32>> = OnceCell::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a Lazy and then we don't even need the explict init call later.
It also should have some docs on what it's for

Comment on lines 296 to 302
let mut policy = TestPolicy {
..Default::default()
};
policy.app_build = "a build".to_string();
policy
.last_sent_build
.replace(Some("a different build".to_string()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be simplified to:

        let mut policy = TestPolicy {
			app_buid: "a build".to_string(),
			last_sent_build: Some("a different build".to_string)),
            ..Default::default()
        };

} else {
// Can only kick off the "metrics" ping scheduler after we have a global Glean.
let glean = &GLEAN.get().unwrap().lock().unwrap();
scheduler::schedule(&glean);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably swap around the if condition here:

if GLEAN.set(Mutex::new(glean)).is_ok() {
	// Can only kick off the "metrics" ping scheduler after we have a global Glean.
    let glean = &GLEAN.get().unwrap().lock().unwrap();
    scheduler::schedule(&glean);
} else {
    // ...
}

That is: handle the error case last

reason: Option<&'static str>,
) {
let thread_gen = Arc::clone(&THREAD_GENERATION.get_or_init(Default::default));
std::thread::Builder::new()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm a bit worried about this new thread. It's long running and sleeping all the time.
I think we had the problem before that Firefox tooling wants us to ensure threads are stopped before the main thread. As it stands we don't have a way to inform this thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Additionally I'd wish we would just have a way to hook into whatever-existing background-threaded timers the surrounding thing gives us :( )

@chutten
Copy link
Contributor Author

chutten commented Apr 30, 2021

Notable changes:

  • Split ping submission away from scheduling for clarity
  • Use of Condvar for scheduling, cancellation
  • Lotsa tests

I should've caught all the previous comments. Lemme know if I missed any.

....and clippy immediately fails for something make lint-rust didn't catch locally? Weird. Maybe my clippy's out of date. Anyhoo. Nitfixes incoming.

@chutten chutten force-pushed the bug1672951-CoreMPS branch 3 times, most recently from 8377191 to c189ff9 Compare April 30, 2021 21:06
@chutten
Copy link
Contributor Author

chutten commented Apr 30, 2021

Had to add a lock for scheduler tests, which makes my nose twitch. Being able to cancel the scheduler from any thread without caring if the scheduler's got work to do is a feature, not a bug, but any time I have to do this sort of workaround it makes me wonder if I'm decreasing safety.

Copy link
Member

@badboy badboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of questions inline.
Overall this code looks good though and we're close to landing this.

I wonder if you already managed to test this out in use in FOG?


// `When` is responsible for date math. Let's make sure it's correct.
#[test]
fn test_when() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_ prefix is noise in the output IMO. Can we make the function names a bit more self-describing?

@@ -233,6 +248,9 @@ impl Glean {
max_events: cfg.max_events.unwrap_or(DEFAULT_MAX_EVENTS),
is_first_run: false,
debug: DebugOptions::new(),
// Subprocess doesn't use "metrics" pings so has no need for a scheduler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't apply anymore I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it might, though. There shouldn't be a subprocess scheduler.

glean-core/src/scheduler.rs Show resolved Hide resolved
glean-core/src/scheduler.rs Show resolved Hide resolved
glean-core/src/scheduler.rs Show resolved Hide resolved
// Ensure that if we have a different build, we immediately submit an "upgrade" ping
// and schedule a "reschedule" ping for tomorrow.
#[test]
fn test_different_app_builds() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Rust one usually doesn't use the test_ prefix for this functions.
Otherwise all functions in the output just start the same, making it harder to parse at a glance.

glean-core/src/scheduler.rs Show resolved Hide resolved
Comment on lines 435 to 436
*cancelled_lock.lock().unwrap() = true; // Cancel the scheduler thread.
condvar.notify_all(); // Notify any/all listening schedulers to check whether they were cancelled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this just call cancel()? It's what the outside will call and does the same inside, so we don't need to repeat it over and over again in tests.

Suggested change
*cancelled_lock.lock().unwrap() = true; // Cancel the scheduler thread.
condvar.notify_all(); // Notify any/all listening schedulers to check whether they were cancelled.
super::cancel();

Comment on lines 444 to 445
*cancelled_lock.lock().unwrap() = true; // Cancel the scheduler thread.
condvar.notify_all(); // Notify any/all listening schedulers to check whether they were cancelled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above. Gonna stop calling it out.

glean-core/src/lib.rs Outdated Show resolved Hide resolved
@@ -233,6 +248,9 @@ impl Glean {
max_events: cfg.max_events.unwrap_or(DEFAULT_MAX_EVENTS),
is_first_run: false,
debug: DebugOptions::new(),
// Subprocess doesn't use "metrics" pings so has no need for a scheduler.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it might, though. There shouldn't be a subprocess scheduler.

glean-core/src/lib.rs Outdated Show resolved Hide resolved
pub fn cancel() {
let (cancelled_lock, condvar) = &**TASK_CONDVAR; // One `*` for Lazy, the second for Arc
*cancelled_lock.lock().unwrap() = true; // Cancel the scheduler thread.
condvar.notify_all(); // Notify any/all listening schedulers to check whether they were cancelled.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other news, I fully expect to have to store the task scheduler's joinhandle somewhere and join on it here when we start getting intermittent at-shutdown problems : |

glean-core/src/scheduler.rs Show resolved Hide resolved
glean-core/src/scheduler.rs Show resolved Hide resolved
@chutten chutten requested a review from badboy May 3, 2021 17:15
Copy link
Member

@badboy badboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now! From the chat I take you took this to a test drive, did this require any last changes or was that purely FOG work now?

glean-core/src/scheduler.rs Show resolved Hide resolved
glean-core/src/scheduler.rs Outdated Show resolved Hide resolved
Mostly a translation from other LBs' MPSes, but with a few changes.

Most notably a split of scheduling from ping submission that makes for
easier-to-proxy operations that are easier to test.

Also notable is the use of a condvar for cancellable tasks.
Co-authored-by: Jan-Erik Rediger <[email protected]>
if !old_enabled && enabled {
glean.start_metrics_ping_scheduler();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, interestingly we don't do this on Kotlin or Swift.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's correct though, when re-enabled we should also let the MPS do its work again

@chutten chutten merged commit 1a8ad80 into mozilla:main May 5, 2021
@chutten chutten deleted the bug1672951-CoreMPS branch May 5, 2021 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants