In this latest #UberData installment, we bring you the data science details of how we use classic Bayesian statistics to solve a uniquely Uber problem.

One of the projects the #UberData Team worked on earlier this summer was determining which businesses Uber riders like to patronize. What kind of food? Which airports, which hotels? At first glance, this seemed simple enough: reverse geocode the dropoff coordinates, or use a publicly available database containing numerous urban addresses to match up to nearby businesses.

But soon we realized that the “nearby businesses” part was precisely the problem. In certain parts of San Francisco, and other dense urban areas like New York City, you can’t swing your handbag without hitting a restaurant. And just because you get dropped off in front of one doesn’t mean you’re going to it. During rush hour or at a busy intersection, we’re often willing to walk a few blocks to arrive at our final destination if this means avoiding a traffic jam or an extra red light.

As we introduce UberPool, the carpool option of our app, where drivers make multiple pickups and dropoffs, these occurrences will only happen with increasing frequency… which means that any Uber rider destination popularity analysis is susceptible to be inaccurate due to noise.

What we need is information on where people ultimately want to go… rather than where people may specify where they want to be dropped off, in order to get there. Let’s build a probabilistic model that given dropoff latitude and longitude coordinates for a trip, predicts the address of the rider’s final destination.

In this post, we show you how Uber can use Bayesian statistics and where you get dropped off, to predict where you’re going 3 out of 4 times.

 

Our Uber Data

We took the riding patterns of over 3000 unique riders in San Francisco earlier in 2014 (anonymizing the data to protect privacy.) Each of these trips had been “tagged” by the rider: when requesting an Uber, the rider had filled in the destination field. We assumed that this represented the true destination the rider wanted to go, creating a gold standard against which we can compare the predictions of our model.

 

Probabilistic Prediction

Each address is a discrete indexed unit i = 1 \dots N.  The goal of our model is to accurately predict the address i that the trip ends at given a previous point along the trip (dropoff), the time of the trip, and other characteristics of the rider and landscape.

Since the model is probabilistic, this involves calculating the probability of each address i being the trip’s final destination, P(D = i | X = x), where D is a random variable representing the final destination and X is a random variable representing observed features of the trip.

Separating this probability into a likelihood component and a prior component and applying Bayes’ Theorem gives

P(D = i|X=x) \frac{P(X=x|D=i) P(D=i)}{\sum_{j=1}^N P(D=j|X=x)P(D=j)}

Here P(D=i) is the prior probability of the final destination being i, calculated as a weighted sum of assumptions, which we will lay out in the next section. P(D=i | X=x) is the likelihood of the final destination being i, having seen observed features X. This is computed from the distribution of distances between drop-off and final destination, as well the implications of time of trip.

 

Constructing the Prior

Rider Prior

The rider prior P_{\text{Rider}}(D=i) incorporates useful information about a rider’s personal destinations into the model. The intuition is that individual riders are likely to go to certain addresses that other people are unlikely to go to (their house, their work place, etc.)

Thus what P_{\text{Rider}}(D=i) is really saying, is P(D=i | C = c), where C is a random variable representing the identity of the rider. This distribution can be visualized as a histogram: each bar corresponds to a normalized number of times that rider c has been to address i.

Note that the Rider Prior assumes a closed world model: only the addresses that a rider has previously gone to are assigned any probability. (All other addresses have a probability of zero.)

This assumption, which is widely used in location-related algorithms, greatly simplifies things, but is obviously naive. In real life, people sometimes go to places they’ve never been before. More worryingly, the Rider Prior breaks down in the case of a new user. Locations must not only have been destined to; they must also have been observed. (So under this naive assumption, a new rider has zero probability of going anywhere!)

To address this conundrum we add 2 additional components to the prior: the Uber Prior and the Popular Place Prior.

Uber Prior

The Uber prior P_{\text{Uber}}(D=i) takes a wider lens and exploits the fact that Uber riders, all together, are likely to go to certain places.

P_{\text{Uber}}(D=i) = P(D=i| \text{is Uber user}) is the normalized number of times that any Uber Rider in the dataset has gone to address i.

Figure 1: The Uber Prior and a sample Rider Prior are mapped. Larger circle radii indicate that more trips in the training data were taken to that location.

Popular Place Prior

With the Rider Prior and Uber Prior, we assume the only addresses in SF are the addresses frequented by riders in our dataset. This is clearly inaccurate. The Popular Place Prior is our “cover everything” prior, and is constructed with data using 1000 businesses that fit each of the following verticals in San Francisco well:

  • restaurants
  • nightlife
  • hotels
  • shopping
  • museums
  • health

P_{\text{Popular Place}}(D=i) is the normalized number of reviews left for a business establishment on the site.

Combining the Priors

Putting the 3 priors together allows for good coverage. The Rider Prior covers presumably the places you often go to. The Uber Prior covers the places your friends often go to (and in turn where you might go to, i.e., your friend’s house). The Popular Place Prior covers any other location of note.

Certain locations will be tracked in more than one prior and of course, there are locations that fall outside all 3 delineations. (The hope is that these are few and far between.)

P(D=i) = {\alpha} P_{\text{Popular Place}}(D=i) +{\beta} P_{\text{Uber}}(D=i) + (1-{\alpha}-{\beta})P_{\text{Rider}}(D=i)

Currently we take {\alpha} = .3 and {\beta} = .3, we want to get a grasp on how each of the priors affects accuracy.

 

Constructing the Likelihood

Riders tend to be unwilling to be dropped off very far from their final destinations (see Figure 2 below). Intuitively, the farther away an address is from the drop-off, the less likely it is.

Figure 2: The distribution of distance between dropoff and final destination. The red vertical line indicates the 80th percentile.

We formalize this intuition with the likelihood: P(Y = y| D=i). Y is the observations of Haversine distance between dropoffs and the final destination (essentially, the distance as the crow flies).

Now what does this likelihood look like? We model it using a Gaussian distribution \mathcal{N}(\mu, \sigma), taking as \mu and \sigma^2 the maximum likelihood estimates \hat{\mu}_{MLE} and \hat{\sigma^2}_{MLE}.  The MLE estimates for parameters of a gaussian are just the sample mean and sample variance of the data. Then P(Y=y|D=i) = \mathcal{N}(Y=y|\mu, \sigma^2).

We also need to account for the type of neighborhood the dropoff is in. In a sketchy neighborhood, a rider might not want to walk even 50 meters, but in a busy downtown, dropoffs might be an approximate corner, in the general vicinity of where a rider is trying to ultimately go.

To capture this variance, we fit individual parameters for each zip code. That is,

\hat{{\mu}_{Z = z}}=\frac{1}{\sum_{k=1}^n \mathbb{1} (Z=z)} \sum_{k=1}^n {x_k} \mathbb{1} (Z=z)

\hat{{\sigma}^2_{Z = z}}=\frac{1}{\sum_{k=1}^n \mathbb{1} (Z=z)} \sum_{k=1}^n ({x_k}-\hat{{\mu}_{Z = z}} )^2

Then a trip with observed dropoff-destination distance difference y in zipcode z will have the likelihood P(Y=y | D= i)=\frac{1}{2 {\pi} {\sigma}} exp[-\frac{(x-\hat{{\mu}_{Z = z}})^2}{2\hat{{\sigma}^2_{Z = z}}}]

The second part of the likelihood is temporal. Certain locations are more likely than others depending on the time of day and day of week. Commuting patterns imply that people will go to office buildings in the financial district of San Francisco in the morning and leave in the evening; night club trips are unlikely Monday morning, but restaurant trips are likely from 5-8 pm.

To model these tendencies, we use T as the random variable representing time of trip. P(T = t | D = i) is the probability of the final destination being i given that the time at dropoff is t. We specify P(T=t|D=i) as a categorical distribution with event probabilities calculated as a normalized count of trips taken to location i at time t. (The time increment we use is one hour.)

 

Inferring the Posterior

In the previous sections, we introduced the prior and likelihood components. Multiplying these together gives us a number that is proportional to the posterior probability P(D = i | X = x). Note in particular that we are making an independence assumption between the 2 components of the likelihood , that P(X = x | D = i) = P(Y=y, T=t | D=i) = P(Y=y|D=i)P(T=t|D=i).

This assumption is simplistic, since it is certainly possible that there is some interaction between Y and T. Basically, time could affect how far someone is willing to walk from their Uber to their final destination (but for now we assume this effect is minimal).

Figure 3: An example posterior distribution for a correctly predicted test instance: 1348 Sacramento Street, San Francisco

 

Results

We evaluated our model using classic machine learning techniques, splitting the data into a test set and a training set to ensure our model was not tuned to just a specific set of trips in our dataset.

We iterated through each test trip, first outputting a candidate list of locations within 100 m of the dropoff. Then, we calculated the posterior probability of each of the candidates. We used the maximum a posteriori estimate (abbreviated as MAP): that is, we chose the address with the maximum posterior probability. We checked whether this address string matched the true address.

We found that 74% of the time, our model could correctly predict the exact address.

Think about the sheer number of possible businesses on a typical city street. About 3 out of every 4 times, we could correctly identify which of the numerous possibilities riders are headed — all with no additional information or context to go on. 

We compared our model results against 2 baselines: the naive baseline, and the smart baseline. The naive baseline made a random choice among the candidate locations and achieved 40% accuracy. The smart baseline took the closest candidate location and achieved 44% accuracy. Thus, in the context of our alternatives of how we elect a final destination, our model results are a good start at determining where we help people get to in aggregate, but aren’t good enough yet for us to stop thinking about this type of problem.

 

Where are we going next?

Our rider destination model is one way the #UberData Team is working on improving the Uber ride experience. Extensions of this project involve building more complex priors and likelihoods. For instance, one intuition is that people are likely to go to locations that are close to locations they frequent. (For instance, restaurants near their home, or subway stop near their workplace.) We could represent this with an additional prior specified as a bivariate gaussian.

In the interim, follow our #UberData Twitter stream or read more #UberData blog posts to learn more about what we’re up to.