Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter search radius formula for map matching #3184

Merged
merged 5 commits into from
Nov 17, 2016

Conversation

kerrick-lyft
Copy link
Contributor

@kerrick-lyft kerrick-lyft commented Oct 26, 2016

This is a long commit message, sorry! The gist is that I found a better formula for the map match search radius that gives good results but is still small enough to be performant.

OSRM’s /match endpoint can take a long time to finish a request, even on a powerful, modern server, which has led to a lot of additional latency as we deploy map matching in more places.

The running time of the map-matching algorithm is approximately quartic (O(n^4)) in the candidate point search radius; a 2x increase in the search radius for each input point will tend to lead to a 4x increase in the number of candidate points, and a 4x increase in candidate points (states) will lead to a 16x increase in the number of operations the Viterbi algorithm performs.

So decreasing the search radius can dramatically improve the running time. However, the formula OSRM v5 uses, which is search_radius = 3 * gps_accuracy, doesn’t find the correct point in many cases. We internally patched OSRM to use the same formula as OSRM v4, search_radius = 10 * gps_accuracy, which gives good results.

I suspected the optimal search radius isn’t directly proportional to the gps_accuracy, and so I pulled ~1 million data points from Lyft drivers and compared the distance from the raw point to the map-matched point output by our current system (data file here). I bucketed the points by rounding their phone-reported accuracy (Location.getAccuracy() on Android or CLLocation.horizontalAccuracy on iOS) down to the nearest integer. To handle sparsity and to smooth out the data I also added each point to the two neighboring buckets in each direction. For example, a point with accuracy 3.7 would be included in the buckets 1, 2, 3, 4, and 5. Finally, I computed the 99.9th percentile raw->map-match distance for each bucket. See this script for the specific computation: https://drive.google.com/file/d/0B30B6-L__QYKbXZwUl9DbkVsbDA/view

Here’s a graph of the results: https://drive.google.com/file/d/0B30B6-L__QYKU0RmZjI0ZGxYZ1E/view

The upward trend stops at around bucket = 47. Removing buckets >= 48, we see a clear linear trend: https://drive.google.com/file/d/0B30B6-L__QYKNTF5bm1YWGNaOGc/view?usp=sharing

We can fit a trendline to this data: https://drive.google.com/file/d/0B30B6-L__QYKeF8yb3ZiTkhjV1E/view?usp=sharing

This gives the formula search_radius = 3.45 * gps_accuracy + 44.4. Since P99.9 radius was essentially never more than 200 meters, we cap the search radius at 200 meters and round up the coefficients, giving the final formula search_radius = min(3.5 * gps_accuracy + 45, 200). This formula should yield a search radius that contains the correct point 99.9% of the time.

We allow the caller to configure these parameters so they can tweak the performance / accuracy tradeoff. The 200 meter cap might be a quirk of the data I processed, but the caller can change it if they want.

As a result of this change, the latency of our map match calls dropped significantly (vs our patched OSRM that used 10 * gps_accuracy) without any degradation in accuracy on our test dataset. OSRM currently uses the formula 3 * gps_accuracy, so for mainline OSRM this change means a modest increase in latency, but the map match results should be more accurate. The new formula provides a good tradeoff between latency and accuracy.

@TheMarex
Copy link
Member

Thanks for this great analysis! 👍

I need some time to look at this in more detail, but I think we can use this as a basis to not only fix the search radius, but actually provide a real empirical correction over the original measurements done in the Newson and Krumm paper.

Right off the bat I'm skeptical about exposing this as a querty-time parameter. Basically what we are defining here are some properties of an empirical distribution. Hard-coding that would be fine with me, if we can ensure that the measurements are good. The 200m limit is also something that was enforce in the Newson and Krumm paper.

My main concern is about documenting this behavior. A user might expect increasing the value for radius will also continuously increase the search radius.

BTW. If you are looking at some more performance improvements, there are a few things you can do. One would be to incorporate a bearing filter (or the more advanced version, modify the emission probability in the hidden markov model to use the bearing value). Speeding up the Viterbi algorithm by using a less naive version that limits the amount of memory used would also be possible. Hit me up by email if you have questions.

@MoKob MoKob added the Review label Oct 27, 2016
@kerrick-lyft
Copy link
Contributor Author

I'm not sure why the Appveyor build is failing, could you help me debug that?

Thanks for your response :) I'm working on getting the accuracy_to_distance.csv file released so you all can take a look and maybe improve the emission probability function. Our Legal team wanted me to run it by them but I don't think there will be anything objectionable about releasing it.

I'd strongly prefer to keep the search_radius_* values as parameters because it makes it really easy to debug what's happening and answer questions like "would this case have behaved differently if the parameters were set higher?". I have noticed cases where the values of these parameters matter, it's just not different enough to materially affect our overall metric. Otherwise you need to do a compile (and in our case, a deploy, since the United States data file doesn't fit in RAM on our development machines) in order to test a new combination of the parameters.

I'll update the docs to indicate the radius param doesn't directly correspond to the search radius, and also add the search_radius_* params to the docs.

We've actually experimented with using the phone-reported bearing to modify the emission probability! The distribution of street_bearing - reported_bearing looked pretty tight but when we tried incorporating it into our model we didn't really see gains. We were pretty time-crunched and didn't have much time to spend on it though, so it may be worth revisiting. Will let you know if we make progress in this direction.

We're definitely interested in any performance gains we can get. I'll start an email thread so we can discuss there.

@TheMarex
Copy link
Member

TheMarex commented Nov 4, 2016

I'd strongly prefer to keep the search_radius_* values as parameters because it makes it really easy to debug what's happening and answer questions like "would this case have behaved differently if the parameters were set higher?".

I understand this is great for experimenting and debugging. But when it comes to including something in our API my primary concern is always two factors:

  1. Limit surface area of API to minimize the chance of needing to do breaking changes
  2. Offer an API that is as simple as possible

From experience, once you include something in your API that is not completely obvious people will misunderstand and misuse it (:wave: bearings and radiuses). Since this is sadly not an internal API I can't add these parameters without fear that we will need to change/remove/extend them soon.

I know this is unsatisfying, but let's keep production and prototyping code separate for now.

Thanks for your response :) I'm working on getting the accuracy_to_distance.csv file released so you all can take a look and maybe improve the emission probability function.

That would be great.

I'm not sure why the Appveyor build is failing, could you help me debug that?

There were a number of problems with Windows unrelated to your PR, there is a good chance that a rebase will fix it.

We've actually experimented with using the phone-reported bearing to modify the emission probability!

Great! I'm currently trying to find a test data set that includes device bearings for some own experiments.

Proposal:

  • Remove API parameters
  • Use current calibration value as constant
  • Extend docs to capture the delicate matter of the radius <-> GPS precision relationship

@kerrick-lyft
Copy link
Contributor Author

kerrick-lyft commented Nov 8, 2016

Here is the accuracy_to_distance.csv data file. Keep in mind that this is just based off our best estimate of the driver's position, using OSRM 5.2.6 with the OSRM v4 radius formula.

Sounds good, I'll make those changes and then ping this PR.

@kerrick-lyft
Copy link
Contributor Author

I've implemented the changes you suggested and rebased, PR is ready for review again.

@TheMarex
Copy link
Member

@kerrick-lyft looks good to me. 👍 Merging.

@masoudkazemi
Copy link

Hi,
I just read the below part of documentation and this ticket is referenced for further reading. But I check the match code of OSRM and it seems that the changes about suggested radius formula are reverted.
Would you please clarify why?
Is it needed to update the documentation too?
Thanks.

The radius for each point should be the standard error of the location measured in meters from the true location. Use Location.getAccuracy() on Android or CLLocation.horizontalAccuracy on iOS. This value is used to determine which points should be considered as candidates (larger radius means more candidates) and how likely each candidate is (larger radius means far-away candidates are penalized less). The area to search is chosen such that the correct candidate should be considered 99.9% of the time (for more details see this ticket).

@danpat
Copy link
Member

danpat commented May 31, 2020

I dug back through my emails, and discovered why this was reverted:

We've just made a small bugfix release and tagged OSRM v5.5.2. This release reverts a change to the map-matching code that caused some fairly major slowdowns if map-matching against the foot or bike profiles (#3184). This was an unintended side-effect of the original PR, so we'll take it back to the drawing board to see if we can re-integrate it without breaking map-matching on dense networks.

The bad news is that we never got around to revisiting that.

The core problem is that the foot/bike networks are quite a bit denser - I don't remember the exact details, but I bet it was causing slowdowns in urban areas where there were sidewalks present. In those cases, you really want to use the minimum radius you can get away with, so as not to include too many candidate points and slow things down. The PR does greatly increase the default radius, which overall makes things slower.

@masoudkazemi
Copy link

Thanks @danpat for the reply.
Ok. What do you think about having separate radius formula based on the profile?
Actually, the data and calculation provided in this ticket were based on the driving profile. One can check specific data for the foot profile and suggest a better formula for it.

@kerrick-lyft
Copy link
Contributor Author

kerrick-lyft commented Jun 1, 2020

+1 to Masoud's comment, I think it would be great if there were a way to customize the search radius based on the profile. I think the formula I gave in the PR description works well for the driving use-case (and we still use that same formula internally at Lyft). But it may make sense to use the current formula for biking/walking (or come up with a smart formula for those cases too, but that's probably further down the road).

@kerrick-lyft
Copy link
Contributor Author

I'm unfortunately not familiar enough with OSRM to know whether this is easy to implement in a robust way or not. Is it possible to tell from the profile whether it's considered a driving, walking, or biking profile? What happens down the road when other modalities are added?

@danpat
Copy link
Member

danpat commented Jun 1, 2020

If the linear relationship you found holds for other modailities, but has a different gradient, then we could allow setting of the factors in the .lua profile....

@kerrick-lyft
Copy link
Contributor Author

I think the current heuristic is (3 * gps_accuracy), unless it changed since I last looked (that was 4 years ago, so there's a good chance). This is a linear relationship so it should be fine. We could have default values for the slope and y-intercept (3 and 0, respectively) and optionally override them in the .lua file (e.g. driving profile would set slope = 3.45 and y-intercept = 45).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants