Skip to content

Error reporting on iOS

ettore edited this page Jan 28, 2021 · 3 revisions

On SimplyE and Open eBooks for iOS we log notable error situations to a third party service (Crashlytics) via a collection of functions in the NYPLErrorLogger class. This document explains how to use those functions effectively.

Privacy Considerations

All data collected are anonymous. There's no way for us to connect any data points in Crashlytics to a physical user. Notably, the barcode is one-way hashed via MD5, making it impossible for us to derive the original value from the hashed value. We also remove sensitive info such as authentication tokens from the logged data.

Objectives

The reason for reporting and recording error situations if to allow developers to fix them. Analyzing user behaviors is outside of the scope of this system.

We use the data in Crashlytics in mainly 2 ways:

  1. by looking at the mass of reported errors proactively, to spot trends or spikes and address the more frequent/critical ones;
  2. from a specific support incident filed by a patron to our customer care team, identify the related Crashlytics report to try to reconstruct the state that led to the error.

Platform Restrictions

Crashlytics makes available to searches only these fields:

  • the domain of the reported NSError aka "Summary" (a string)
  • its code (an integer)
  • the associated userID (the hashed barcode or username)

Crashlytics also collates error reports that have the same domain AND code into a single error group.

Here's a snapshot of the Crashlytics console:

Snapshot of Crashlytics error groups

NYPLErrorLogger Design Principles

NYPLErrorLogger provides 2 main apis for reporting problems.

The first is perhaps the most commonly used when you have an actual Error object. The error is logged in its entirety, including NSError's userInfo. The entries in the metadata param will appear as nicely formatted key-value pairs in Crashlytics's web UI.

func logError(_ error: Error?,
              summary: String,
              metadata: [String: Any]? = nil)

If you don't have an Error object, you can still file a report by providing an error code to the following api:

func logError(withCode code: NYPLErrorCode,
              summary: String,
              metadata: [String: Any]? = nil) 

Beside these, there are a number of additional apis that are more specific to common error situations. Refer to the class documentation for more details.

Summaries

The summary of an NYPLErrorLogger report is going to be the title of the error group in Crashlytics.

Error codes

NYPLErrorLogger uses the NYPLErrorCode enum to categorize errors. The way these codes are defined is to allow searching the log data orthogonally (or at least transversally) in respect to the summary.

Why?? What does that even mean?

Since we have only 2 open-ended ways to search our data (ignoring user-id slicing, which is unambiguous) it's useful to use the error-code search in a way different from summary-search. To make an extreme example, it we were to match a "Error loading authentication document" summary with an error code specific to that error scenario, we'll add nothing in terms of searchability. We'll just add a useless synonymous search possibility.

More useful is instead slicing the data with a different level of specificity.

To this goal, for example, the invalidXML error code can be used in many different error situations: literally anything involving parsing XML. This is useful because it gives us a way to identify both specific XML parsing errors (via the summary) as well as to look at all the cases where our XML parsing code fails (via the error code), and therefore fix XML parsing across the whole app.

How to Report Errors for Great Success

1. Provide a summary that will help developers identify the error, its cause and top-level context... but not more.

For example, if you need to log a book download error, the summary should include (hopefully obviously) something about a book download failure, but also why it failed if possible (e.g. network error? parsing error? file system error?).

For book-related errors, we've found useful to add the book distributor to the summary, because usually that's how those errors are reported to us.

Things that are extremely variable (such as book download URLs, or anything including an ID) should NOT be put into the summary, because any variation in the interpolated summary string will generate a new error group in crashlytics.

2. Add any context you deem useful to the metadata param.

Context is anything that can help to pin-point an error situation. This can vary greatly. Think of yourself browsing an error report group in Crashlytics and answer questions like:

  • can you identify the area of code where the error happened?
  • can you recreate and discern the app state?
  • can you recreate the stack state, i.e. local variable values of interest, recursion state, etc?
  • will you be able to understand which if-then-else branch of interest the app took? Even going upstream in the code flow (if relevant)?

If not... help your future self by filling up the metadata dictionary!

Going back to the book download error example, useful context will be the book object (of course), but perhaps it might be useful to add the failed URL (be it a file or http URL), any flag / variable to help us identify the decision path the app took.

Make sure to never add a nil value key! Especially if the call comes from ObjC. Add nil-coalescing if needed.

To some classes we have added extensions to facilitate logging. These are typically called loggable[Something]. For example, NYPLBook::loggableDictionary.

Future Developments

This class is a continuous work in progress. Here's a few known weaknesses:

  • the error codes are not easily searchable. Crashlytics requires ar least 3 characters for any search. Our codes are all 3 and 4 characters, making it impossible to search for error catagories. For example, all sign-in-out-up errors are in the 3xx class, and it would be good to do a search to list ALL such errors, but we can't. The solution would be to define such codes as 300xx instead.
  • Error codes are not orthogonal enough. This is mainly for historic reasons.
  • Some summaries are too vague. For example, "RemoteViewController" in the picture above doesn't say anything about the actual error that happened.
Clone this wiki locally