Contributing Author/Data Scientist: Brandon Segal
As many news outlets have been reporting the past few weeks there is a new push to begin leveraging the wealth of location data from citizens that are captured from cellphones every day. Not discussing the ethics of whether that is what should be done, it does bring up an interesting question. What is the best way to find out where devices are dwelling?
GPS Data Overview
Knowing where devices are dwelling could help inform public services if social distancing efforts are working by showing locations with high densities of devices or by showing whether certain businesses are not observing given guidelines. To answer these kinds of questions, local governments could turn to data brokers who provide high precision GPS data from millions of citizens who have apps on their phones with location permissions. This data is what is commonly referred to as spatiotemporal data. Spatiotemporal data is, as the name suggests, data that has both a time and location associated with it and is used to describe moving objects or geometries that change over time.
Let’s look at an example of the spatiotemporal data your phones produce and send to these different data brokers.
Spatiotemporal Data Sample
On a map, this data would be a single point and would be pretty uninteresting by itself. However, when records are taken over a longer period we can begin seeing patterns emerge. Below is an example of a GPS trace from a University student in Seoul over two months. Notice how from these traces we can likely determine their common travel patterns, common dwell points, and outliers on the map caused by infrequent trips or device inaccuracies.
Student at Yonsei University GPS traces over 2 months taken from an Android Device (link)
Extracting Stay Points
Plotting these traces on a map makes drawing general qualitative trends easy but it is still difficult to measure where a device spends time whether it is in a specific coffee shop, mall, or even a census block. Grouping these unlabeled data points into logical groups calls for a clustering algorithm that can group these based on the density of points. These kinds of clustering algorithms are called Density-Based Clustering Algorithms with some examples being:
- DBSCAN: A density-based algorithm based on the distance between points and a minimum amount of points necessary to define a cluster
- OPTICS: A variant of DBSCAN but with a hierarchical component to find clusters of various densities
- DJ-Cluster: Standing for Density-Join Clustering Algorithm, it is a memory sensitive alternative to DBSCAN
Each of the algorithms above has its own benefits but they all depend on the following:
- Having a defined distance metric to measure how far the points are from each other
- A method for defining whether a group of points can be considered a cluster.
Each of the algorithms can be used with spatial data to find groups of points but the main challenge is when the temporal component is introduced. This is because the distance components can no longer be defined just by the euclidian distance between the x and y components but now you have to take into account the distance points are in time. While working with spatiotemporal data I have used several clustering methods but the one that worked best with GPS Data derived from phones was one called D-StaR.
D-StaR stands for the Duration based Stay Region extraction algorithm which was developed by Kyosuke Nishida, Hiroyuki Toda, and Yoshimasa Koike for extracting arbitrarily shaped stay regions while ignoring outliers. The benefits this algorithm has over others include:
- It allows for the points to be sampled more irregularly than other algorithms that depend on a minimum amount of points to define a cluster
- It has computational and memory benefits over algorithms that require computing the distance one point is from every other point
- It has the potential to handle streaming data and emit stay regions in a more event-driven application
D-StaR expects the input data set to be sorted by the time from beginning to end. After sorting the data the algorithm can be has a few input variables including the following:
- ϵ: A spatial distance that a point has to be within to be considered a spatial neighbor to a core point (measured in meters)
- q: A window size that determined the number of points before and after a core-point to find spatial neighbors (unitless)
- t: A temporal threshold to consider a merged cluster a stay region (measured in seconds)
Example of cluster determined by D-Star for a distance threshold ϵ, sliding window size q = 3, and duration threshold for core point t= 15. Point pi is described as (i, ti) in this figure
Using this algorithm I found improved F1 scores of finding actual stay regions by 20% from data derived from phone data. I invite you to read this paper linked above that goes over D-StaR as that helps support this kind of academic research. In the next post, I will go over how to implement this algorithm using Python and how we can implement it in a data processing pipeline.
About the author:
Brandon Segal is a Software Engineer currently working on building enterprise machine learning infrastructure with a strong focus on data engineering and real time inference. Previously Brandon has worked in data visualization and as a geospatial data scientist where he learned the interesting challenges with building data intensive applications. Located in the DMV area and originally from the Philly area he loves spending time with his wife, volunteering in Arlington, and continued learning.