A Mathematician Walks into a Restaurant: Trying Simple Models to Pick Roadtrip Stops

I develop two statistical approaches for ranking restaurants based only on their average rating and number of reviews to use on an up-coming roadtrip.

MathematicsStatistics

Contents

Open Contents

Starting the Journey

My partner and I recently traveled back home to spend some time with my family for the 4th of July. The two-day drive means we’re regularly panning through Google Maps trying to pick out spots in unfamiliar towns. I select restaurants based on their review descriptions, a few photos, review count, and average rating. Typically, we end up deliberating between two roughly even restaurants: one with 4.6 stars and 500 ratings vs. one with 4.7 stars with 75 ratings, for example.

I thought for our trip back, it would be fun to develop some statistical approaches for ranking restaurants based on the sparse data you get from Google Maps

Outlining the Framework Outline

I want a simple app that takes in a collection of restaurant ratings and rating counts - like you’d see on Maps - and computes two candidates:

This allows us to have options based on our current risk tolerance, since a restaurant with 4.8 stars from 40 reviews could either be an undiscovered gem or have sample bias from the owner’s supportive friends.

For each restaurant, we’ll create Feeling Lucky and Safe strength estimates based on its average rating and review count. A Feeling Lucky estimate should favor a restaurant with the highest average rating but fewer ratings while the Safe estimate should favor restaurants with many ratings but not necessarily the highest average rating.

We can then use these strength metrics in the Bradley-Terry model to calculate the likelihood one location is better than the other:

Pr(i>j)=πiπi+πj\Pr(i>j)={\frac {\pi_{i}}{\pi_{i}+\pi_{j}}}

where π is the estimation of a restaurant’s strength.

Modeling Restaurant Strength

We only have access to summary statistics about the restaurants, so we’ll need to keep our approach simple. We’ll use these restaurants as examples while testing our model:

Average RatingNumber of Ratings
A4.6 stars500
B4.9 stars75
C3.7 stars1000

Approach 1: Scaled Counts and Ratings

Let’s start by normalizing our review counts to the sum of all ratings across all restaurants (N) and the rating to the maximum rating of 5:

πi=reviewsiNratingi5\pi_i = \frac{reviews_i}{N}* \frac{rating_i}{5}

This approach suggests.strength is based on the fraction of total observations made, which has the effect of strongly favoring the most-rated restaurant without much regard for its average rating:

Strength EstimateAverage RatingNumber of Ratings
A0.294.6 stars500
B0.054.9 stars75
C0.473.7 stars1000

Looking at the chart above, the model fails because it ignores rating quality entirely: nobody would prefer C to either A or B. Let’s add a corrective rate k that scales based on N to “boost” the rating term, since it’ll typically be several orders of magnitude less than the number of ratings:

πi=reviewsiN(ratingi5)k(N)\pi_i = \frac{reviews_i}{N}* \bigg(\frac{rating_i}{5}\bigg)^{k(N)}

Making k(N) scale with log10(N) ensures the rating boost grows proportionally with the scale of review counts, preventing either term from dominating across different restaurant popularity levels. We can play with the coefficient to find values we can associate with the Feeling Lucky and Safe options:

k(N) = M * log10(N)M = 1M = 5M = 10M = 15
A Strength0.240.0840.0220.006
B Strength0.040.0340.0250.018
C Strength0.240.0050.0000.000

By M = 5, we see the strength of C has been driven to below the others, resulting in a reasonable ordering: A > B > C. By A = 15, we see B overtake A due to the high rating weight. As a result, we’ll move forward with the following strength estimators:

πsafei=reviewsiN(ratingi5)5log10(N)Nπluckyi=reviewsiN(ratingi5)15log10(N)\pi_{safe_i} = \frac{reviews_i}{N}* \bigg(\frac{rating_i}{5}\bigg)^{5 * log_{10}(N)} \\ \phantom{N}\\ \pi_{lucky_i} = \frac{reviews_i}{N}* \bigg(\frac{rating_i}{5}\bigg)^{15 * log_{10}(N)}

Going forward, we’ll refer to this method as BOTE, short for “back of the envelope”.

Approach 2: Empirical Bayes + Shrinkage

Our technique for k(N) was based on intuition and fitting to a small sample set. I want to compare our simple heuristic from Approach 1 to something with more rigor. Our biggest challenge here is not having access to the raw ratings data (how many 5s, 4s, etc.): this prevents us from using direct Bayesian methods to fit a distribution (ex. the Beta Distribution) to rating data and compute a confidence interval for the expected rating (the lower bound would be the Safe option and the upper bound the Feeling Luck option). We can, however, adapt Empirical Bayes to our scenario where we only have summary statistics.

Empirical Bayes involves making assumptions about the global population - average restaurant ratings, in our case - and move our specific observations towards that global mean, called Shrinkage. The magnitude of shrinkage depends on sample size: restaurants with fewer reviews get pulled more heavily towards the global mean, while those with many reviews stay closer to their observed ratings.

We’ll use heavy shrinkage when calculating our Safe option in order to favor average ratings with more reviews - we want to have more information before trusting the rating. The Feeling Lucky calculation will favor the average rating and will have light shrinkage - we’re willing to trust quality signals of less-rated places.

To demonstrate Empirical Bayes on realistic data where restaurants have similar quality levels, I’ll updating restaurant C to 4.7 stars with 25 reviews.

The first step in Empirical Bayes is to compute the global weighted mean and weighted variance in the ratings. This leads to outlier sensitivity. For us, “global” means the restaurants we’re looking at:

n=number of restaurantsNN=number of total reviews=inreviewsiNμ=inreviewsiNratingi=4.642Nσ2=inreviewsiN(ratingiμ)2=0.010n = \text{number of restaurants}\\ \phantom{N}\\ N = \text{number of total reviews} = {\sum_i^n reviews_i} \\ \phantom{N}\\ \mu = \sum_i^n \frac{reviews_i}{N} * rating_i = 4.642\\ \phantom{N}\\ \sigma^2 = \sum_i^n \frac{reviews_i}{N}\big(rating_i - \mu\big)^2 = 0.010\\

Next, we need to estimate the variance of within the ratings of each restaurant. Without access to individual ratings, we’ll estimate each restaurant’s internal variance using the assumption that uncertainty is inversely proportional to the order of magnitude of its reviews:

σi2=Mlog10(reviewsi)\sigma^2_i = \frac{M}{log_{10}(reviews_i)}

where the parameter M represents our risk tolerance.

We’ll “shrink” the original estimate towards the global mean to produce our strength estimate:

Bi=σ2σ2+σi2=0.0100.010+M/log10(reviewsi)Nπi=Biratingi+(1Bi)μB_i = \frac{\sigma^2}{\sigma^2 + \sigma^2_i} = \frac{0.010}{0.010 + M/\log_{10}(reviews_i)}\\ \phantom{N}\\ \pi_i = B_i * rating_i + (1-B_i) * \mu

The behavior of Empirical Bayes is apparent in the last equation: the adjusted rating - our strength estimate π - is weighted average between the original rating and the global mean. To help see this balance, note that the weights add up to 1:

wratingi+wμ=Bi+(1Bi)=1w_{rating_i} + w_{\mu} = B_i + (1-B_i) = 1

When we have high confidence in the individual rating (large sample), the shrinkage factor approaches 1 and we trust the original rating. When we have low confidence (small sample), the shrinkage factor approaches 0 and we pull toward the global mean.

The shrinkage factor assumes a Normal prior distribution, which is inaccurate but the best we can do given the lack of access to the underlying rating distribution.

With the algorithm in hand, let’s test different values of M:

M0.10.51.02.03.04.05.06.07.0
A (4.6, 500)4.6334.6404.6414.6414.6414.6414.6414.6414.642
B (4.9, 75)4.6514.6824.6464.6444.6434.6434.6434.6424.642
C (4.7, 25)4.6494.6434.6424.6424.6424.6424.6424.6424.642

The table shows that for sufficiently large M - which corresponds to a large sample variance - all restaurants converge to the population mean 4.642. This means that setting M too high will make it impossible to differentiate between our options. Looking at restaurants B and C, we see their small sample sizes are quickly penalized relative to the more frequently review restaurant A.

For our implementation, we’ll strike a balance using the following values of M:

MOptionNotes
0.1Feeling Luckylight shrinkage (trusts individual ratings)
1.0Safeheavy shrinkage (pulls ratings towards the mean)

Going forward, we’ll refer to this method as EBS, short for Empirical Bayes with Shrinkage.

Summarizing the Models

We now have two approaches to test on the road:

  1. BOTE, which is based on our intuition. It combines normalized review counts with ratings, using exponential scaling to balance between trusting high-rated vs. highly-reviewed restaurants.
  2. EBS, which is based on Empirical-Bayes. It balances the individual rating with the weighted average across all restaurants based on information about the specific restaurant and the population of restaurants we’re considering. With more ratings, we trust the individual rating over the population mean; with fewer ratings, we pull toward the population mean.

Both approaches give us the strength estimates we need for the Bradley-Terry model, though both required substituting heuristics for the missing rating distribution data.

Day 1: Starting Strong with Coffee

As we crossed our first state line, we wanted some caffeine. I found four coffee shops with similar ratings, all of which looked good:

Coffee ShopRatingReviews
Awaken4.978
Arrow4.843
Rock4.990
Mallard4.958

Analysis

With N = 269 total reviews, our log scaling factor is log10(269) = 2.43 for Approach 1. For Approach 2, the global mean μ = 4.884 with variance σ² = 0.001, which is quite low.

Coffee ShopBOTE - SafeBOTE - LuckyEBS - SafeEBS - Lucky
Awaken0.227 (2nd)0.139 (2nd)4.884 (1st)4.884 (1st)
Arrow0.0970.0364.884 (1st)4.882
Rock0.262 (1st)0.160 (1st)4.884 (1st)4.884 (1st)
Mallard0.1690.1034.884 (1st)4.884 (1st)

Within each approach, we use Bradley-Terry to determine the likelihood the 1st place shop is better than the 2nd:

ApproachComparisonP(1st > 2nd)
BOTE - SafeRock > Awaken53.6%
BOTE - LuckyRock > Awaken53.6%
EBS - SafeRock > Awaken50.0%
EBS - LuckyRock > Awaken50.0%

Across all methods, this is a toss-up. With no statistical winner, we fell back on a common roadtrip heuristic: pick the closer place. This brought us to Awaken, which was 10 minutes closer. We enjoyed their coffee and outdoor seating while I scribbled down a few notes for this post.

Observations

Both models came to the same conclusion I did: the four shops were likely the same quality based on their ratings. The EBS approach had the most trouble separating the group because the population was tightly grouped around the mean already, so the shrinking had little effect. The BOTE approach produced a spread, but it also didn’t identify a statistically significant winner. Given how close in average rating and rating counts these shops were, this result makes sense - with such similar ratings and review counts, any statistical difference would be insignificant.

Sometimes the answer really is “They’re all good. Pick what’s convenient.”

Day 2: A Clean Lunch to Push Through

The cumulative effect of the coffee and junk food made us desperate for some vegetables. The pickings were slim on our stretch of highway:

RestaurantRatingReviews
Lettuce4.4109
Greener4.615
Care3.893

Analysis

With N = 217 total reviews, our log scaling factor is log10(217) = 2.34 for Approach 1. For Approach 2, the global mean μ = 4.157 with variance σ² = 0.098 (roughly three orders of magnitude larger than yesterday’s):

RestaurantBOTE - SafeBOTE - LuckyEBS - SafeEBS - Lucky
Lettuce0.113 (1st)0.0057 (1st)4.197 (2nd)4.319 (2nd)
Greener0.026 (2nd)0.00374.202 (1st)4.394 (1st)
Care0.0170.00004.0993.922

This time, we had different match-ups for Bradley-Terry across the two models:

ApproachComparisonP(1st > 2nd)
BOTE - SafeLettuce > Greener81.2%
BOTE - LuckyLettuce > Greener60.5%
EBS - SafeGreener > Lettuce50.0%
EBS - LuckyGreener > Lettuce50.4%

EBS didn’t offer a clean winner based on Bradley-Terry. Since both BOTE options favored Lettuce and we weren’t feeling lucky, we went there for lunch. There was a gas station across the road, so we were able to fill multiple tanks in one stop. No complaints!

Observations

EBS continues to yield statistically equivalent results. I thought the higher variance would produce a clear leader, but with our current values of M, Greener’s low review count received a penalty large enough to diminish the impact of its higher rating.

The BOTE model produced the most decisive Bradley-Terry result from the trip, with an 81.2% confidence that Lettuce was better than Greener. Note that the likelihood Lettuce is better than Greener is lower with higher risk tolerance. This aligns with how we constructed the model: we’re more apt to risk disappointment for a potential hidden gem when we’re feeling lucky.

Conclusion: No Free Lunch

After a long trip home, we ended up with a guaranteed win - my favorite Thai restaurant. No model consulted there!

Reflecting on the performance of both models, the small sample size was as much of an issue as expected. I wish I had access to individual ratings distributions rather than just summary statistics. With the raw data, I could fit a proper Beta distribution to each restaurant’s ratings and compute confidence intervals directly. The lower bound would be the Safe option, the upper bound the Feeling Lucky option - no heuristics or parameter tuning required. EBS, in particular, was not effective in producing a clear winner because of the small population we consider. Typically when Empirical Bayes is applied, the population is truly the entire population (ex. the MLB batting average across all players and all teams). The three or four points we gave it weren’t enough for the model to be useful.

That said, coming up with these models was a fun way to pass time on the drive up. There’s something satisfying about formalizing the mental calculations you’re already doing.

The most interesting result was seeing how well the BOTE held up. It seemed to make the same selection I would without the algorithm, which isn’t too surprising given that I came up with the model using my intuition. This makes me think that our BOTE is a fair approximation of whatever latent model I use naturally.

Neither model improved the quality of the typical highway restrooms, though.

Enjoyed this Article?

Get New Posts in Your Inbox

Expect a few emails each month. Unsubscribe any time.

Back to Posts