Rvoterdistance calculates the geographic distance between voters and polling locations (or vote-by-mail drop boxes) using the Haversine great-circle formula, implemented in C++ for speed. The package supports:
sf POINT
geometries directlyThe package ships with two example datasets:
king_dbox: King County, WA ballot drop box locations
and a sample of votersmeck_ev: Mecklenburg County, NC early voting locations
and a sample of voterslibrary(Rvoterdistance)
data(meck_ev)
str(voter_meck)
#> 'data.frame': 4552 obs. of 3 variables:
#> $ county: chr "MECKLENBURG" "MECKLENBURG" "MECKLENBURG" "MECKLENBURG" ...
#> $ long : num -80.9 -81 -80.8 -80.8 -80.9 ...
#> $ lat : num 35.2 35.1 35.2 35.3 35 ...
str(early_meck)
#> 'data.frame': 21 obs. of 4 variables:
#> $ county : chr "MECKLENBURG" "MECKLENBURG" "MECKLENBURG" "MECKLENBURG" ...
#> $ office_addr: chr "BEATTIES FORD LIBRARY 2412 BEATTIES FORD RD" "BETTE RAE THOMAS RECREATION CENTER 2921 TUCKASEEGEE RD" "CORNELIUS TOWN HALL 21445 CATAWBA AVE" "DELTA CENTER 5408 BEATTIES FORD RD" ...
#> $ long : num -80.9 -80.9 -80.9 -80.9 -80.8 ...
#> $ lat : num 35.3 35.2 35.5 35.3 35.2 ...The main function is nearest_location(). With the
default k = 1, it returns one row per voter with the
distance to the nearest polling location:
result <- nearest_location(
voters = voter_meck,
locations = early_meck,
voter_coords = c("lat", "long"),
location_coords = c("lat", "long")
)
head(result)
#> county long lat county
#> 1 MECKLENBURG -80.92800 35.20503 MECKLENBURG
#> 2 MECKLENBURG -80.99874 35.11030 MECKLENBURG
#> 3 MECKLENBURG -80.81264 35.22413 MECKLENBURG
#> 4 MECKLENBURG -80.79422 35.26237 MECKLENBURG
#> 5 MECKLENBURG -80.87096 35.04406 MECKLENBURG
#> 6 MECKLENBURG -80.78188 35.46774 MECKLENBURG
#> office_addr long lat distance_m
#> 1 WEST BOULEVARD LIBRARY 2157 WEST BLVD -80.89657 35.21157 2950.3663
#> 2 STEELE CREEK AREA 11130 SOUTH TRYON ST -80.96072 35.11588 3517.6541
#> 3 MIDWOOD CULTURAL CENTER 1817 CENTRAL AVE -80.80855 35.22020 574.0296
#> 4 SUGAR CREEK LIBRARY 4045 N TRYON ST -80.79749 35.25692 676.0503
#> 5 SOUTH COUNTY REGIONAL LIBRARY 5801 REA RD -80.81186 35.08731 7223.9220
#> 6 CORNELIUS TOWN HALL 21445 CATAWBA AVE -80.85924 35.48172 7184.1009
#> distance_km distance_miles
#> 1 2.9503663 1.8332772
#> 2 3.5176541 2.1857743
#> 3 0.5740296 0.3566863
#> 4 0.6760503 0.4200792
#> 5 7.2239220 4.4887482
#> 6 7.1841009 4.4640044The output includes the voter data, the matched location data, and
three distance columns: distance_m (meters),
distance_km, and distance_miles.
To find the 3 closest early voting sites for each voter:
result_k3 <- nearest_location(
voter_meck, early_meck,
voter_coords = c("lat", "long"),
location_coords = c("lat", "long"),
k = 3,
append_data = FALSE
)
head(result_k3, 9)
#> voter_id rank location_id distance_m distance_km distance_miles
#> 1 1 1 21 2950.3663 2.9503663 1.8332772
#> 2 1 2 2 6166.9263 6.1669263 3.8319598
#> 3 1 3 9 7871.9153 7.8719153 4.8913935
#> 4 2 1 17 3517.6541 3.5176541 2.1857743
#> 5 2 2 9 13708.7253 13.7087253 8.5182281
#> 6 2 3 21 14613.4911 14.6134911 9.0804250
#> 7 3 1 11 574.0296 0.5740296 0.3566863
#> 8 3 2 5 2261.2717 2.2612717 1.4050926
#> 9 3 3 8 2600.9012 2.6009012 1.6161291The output is in long format with a rank column (1 =
nearest).
Find all early voting locations within 5 miles of each voter:
result_5mi <- nearest_location(
voter_meck[1:20, ], early_meck,
voter_coords = c("lat", "long"),
location_coords = c("lat", "long"),
max_dist = 5,
units = "miles",
append_data = FALSE
)
head(result_5mi, 10)
#> voter_id rank location_id distance_m distance_km distance_miles
#> 1 1 1 21 2950.3663 2.9503663 1.8332772
#> 2 1 2 2 6166.9263 6.1669263 3.8319598
#> 3 1 3 9 7871.9153 7.8719153 4.8913935
#> 4 2 1 17 3517.6541 3.5176541 2.1857743
#> 5 3 1 11 574.0296 0.5740296 0.3566863
#> 6 3 2 5 2261.2717 2.2612717 1.4050926
#> 7 3 3 8 2600.9012 2.6009012 1.6161291
#> 8 3 4 18 3901.7540 3.9017540 2.4244436
#> 9 3 5 1 6109.9996 6.1099996 3.7965871
#> 10 3 6 2 6177.6741 6.1776741 3.8386383
# How many locations within 5 miles per voter?
table(result_5mi$voter_id)
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#> 3 1 9 10 1 1 5 2 1 1 2 9 1 3 2 7 2 6 1 2If your data are already sf POINT objects, pass them
directly — no need to specify coordinate column names:
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
voters_sf <- st_as_sf(voter_meck, coords = c("long", "lat"), crs = 4326)
locs_sf <- st_as_sf(early_meck, coords = c("long", "lat"), crs = 4326)
result_sf <- nearest_location(voters_sf, locs_sf, append_data = FALSE)
head(result_sf)
#> voter_id distance_m distance_km distance_miles
#> 1 1 2950.3663 2.9503663 1.8332772
#> 2 2 3517.6541 3.5176541 2.1857743
#> 3 3 574.0296 0.5740296 0.3566863
#> 4 4 676.0503 0.6760503 0.4200792
#> 5 5 7223.9220 7.2239220 4.4887482
#> 6 6 7184.1009 7.1841009 4.4640044If the CRS is not WGS-84 (EPSG:4326), the package automatically transforms to WGS-84 and prints a message.
For quick calculations without the full
nearest_location() interface:
# Minimum distance in km for each voter
km <- dist_km(
voter_meck$lat, voter_meck$long,
early_meck$lat, early_meck$long
)
summary(km)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 3.519e-04 1.874e+00 3.170e+00 3.496e+00 4.627e+00 1.040e+01
# Minimum distance in miles
mi <- dist_mile(
voter_meck$lat, voter_meck$long,
early_meck$lat, early_meck$long
)
summary(mi)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0002187 1.1641727 1.9700480 2.1724660 2.8752326 6.4626295
# Single-pair distance (e.g., Charlotte to Raleigh)
haversine(35.2271, -80.8431, 35.7796, -78.6382, units = "miles")
#> [1] 129.9045The dist_to_boundary() function computes the minimum
distance from each voter to a geographic boundary such as a state
border, river, or district line. The boundary is provided as an
sf geometry object (LINESTRING, MULTILINESTRING, POLYGON,
or MULTIPOLYGON). The computation uses the spherical cross-track
distance formula in C++ with bounding-box pruning, making it practical
for large voter files.
library(sf)
# Simplified AZ-NM border: a vertical line at longitude -109.05
border <- st_sf(
geometry = st_sfc(
st_linestring(matrix(
c(
-109.05, 31.33,
-109.05, 37.00
),
ncol = 2, byrow = TRUE
)),
crs = 4326
)
)
# Two voters: one in Albuquerque, one near the border
voters <- data.frame(
name = c("Albuquerque voter", "Border voter"),
lat = c(35.08, 35.0),
lon = c(-106.65, -109.0)
)
d <- dist_to_boundary(voters, border,
voter_coords = c("lat", "lon"),
units = "km", progress = FALSE
)
data.frame(voter = voters$name, dist_km = round(d, 1))
#> voter dist_km
#> 1 Albuquerque voter 218.6
#> 2 Border voter 4.6When the boundary is a polygon, dist_to_boundary()
measures the distance to the polygon’s perimeter
(nearest edge), not to its interior. A point inside the polygon returns
the positive distance to the nearest edge.
# A rectangular district
district <- st_sf(
geometry = st_sfc(
st_polygon(list(matrix(c(
-110, 35,
-108, 35,
-108, 37,
-110, 37,
-110, 35
), ncol = 2, byrow = TRUE))),
crs = 4326
)
)
# One voter inside, one outside
voters2 <- data.frame(
name = c("Inside district", "Outside district"),
lat = c(36.0, 36.0),
lon = c(-109.0, -107.0)
)
d2 <- dist_to_boundary(voters2, district,
voter_coords = c("lat", "lon"),
units = "miles", progress = FALSE
)
data.frame(voter = voters2$name, dist_miles = round(d2, 1))
#> voter dist_miles
#> 1 Inside district 56
#> 2 Outside district 56If your voter data is already an sf object with POINT
geometry, pass it directly — no need for voter_coords:
voters_sf <- st_sf(
name = c("Voter A", "Voter B"),
geometry = st_sfc(
st_point(c(-106.65, 35.08)),
st_point(c(-109.00, 35.00)),
crs = 4326
)
)
d3 <- dist_to_boundary(voters_sf, border,
units = "miles", progress = FALSE
)
data.frame(voter = voters_sf$name, dist_miles = round(d3, 1))
#> voter dist_miles
#> 1 Voter A 135.8
#> 2 Voter B 2.8dist_to_boundary() supports "km" (default),
"miles", and "meters":
voters_u <- data.frame(lat = 36.0, lon = -108.0)
d_km <- dist_to_boundary(voters_u, border,
voter_coords = c("lat", "lon"), units = "km", progress = FALSE
)
d_mi <- dist_to_boundary(voters_u, border,
voter_coords = c("lat", "lon"), units = "miles", progress = FALSE
)
d_m <- dist_to_boundary(voters_u, border,
voter_coords = c("lat", "lon"), units = "meters", progress = FALSE
)
data.frame(km = round(d_km, 2), miles = round(d_mi, 2), meters = round(d_m, 1))
#> km miles meters
#> 1 94.56 58.76 94560.5The Haversine computation runs in C++ and uses partial sorting
(std::nth_element) for k-nearest queries, giving O(n) per
voter instead of O(n log n). The dist_to_boundary()
function uses bounding-box pruning to skip distant boundary segments,
avoiding unnecessary cross-track distance calculations. For large voter
files, enable progress reporting:
A natural application of dist_to_boundary() is a
geographic regression discontinuity design (RDD). The
distance to a boundary serves as the running variable (score), with the
boundary itself as the cutoff. Voters on one side receive a “treatment”
(e.g., different jurisdiction, policy environment, or services) and
voters on the other side serve as controls.
This example simulates 50,000 voters around the Sandia Pueblo
reservation in New Mexico and estimates the effect of reservation
residence on voter turnout using the rdrobust package.
We use a simplified polygon approximating the Sandia Pueblo reservation, which sits north of Albuquerque between the Rio Grande and the Sandia Mountains.
library(sf)
library(ggplot2)
library(rdrobust)
# Simplified Sandia Pueblo reservation boundary
sandia_coords <- matrix(c(
-106.6140, 35.1850,
-106.5400, 35.1800,
-106.4800, 35.2000,
-106.4500, 35.2350,
-106.4450, 35.2700,
-106.4600, 35.3050,
-106.4900, 35.3200,
-106.5500, 35.3250,
-106.5900, 35.3100,
-106.6100, 35.2750,
-106.6200, 35.2350,
-106.6140, 35.1850 # close the ring
), ncol = 2, byrow = TRUE)
sandia <- st_sf(
name = "Sandia Pueblo",
geometry = st_sfc(st_polygon(list(sandia_coords)), crs = 4326)
)Voters are drawn from a mixture of clusters reflecting actual population centers around the reservation: northeast Albuquerque (largest cluster), Bernalillo, Corrales, Rio Rancho, Placitas, and a sparser set on the reservation itself.
set.seed(2024)
n <- 50000
# Population centers: lat, lon, mixture weight, spatial spread
# 1. NE Albuquerque / Sandia Heights (south of reservation, dense)
# 2. Bernalillo (northwest, medium)
# 3. On/near reservation (sparse)
# 4. Rio Rancho (west, medium)
# 5. Placitas (northeast, small)
# 6. Corrales (west along the river)
centers <- data.frame(
lat = c(35.160, 35.310, 35.250, 35.275, 35.340, 35.235),
lon = c(-106.520, -106.560, -106.530, -106.680, -106.440, -106.630),
weight = c(0.35, 0.18, 0.12, 0.17, 0.06, 0.12),
sd_lat = c(0.035, 0.025, 0.035, 0.025, 0.018, 0.018),
sd_lon = c(0.035, 0.025, 0.040, 0.025, 0.018, 0.018)
)
cluster <- sample(nrow(centers), n, replace = TRUE, prob = centers$weight)
voters <- data.frame(
voter_id = seq_len(n),
lat = rnorm(n, mean = centers$lat[cluster], sd = centers$sd_lat[cluster]),
lon = rnorm(n, mean = centers$lon[cluster], sd = centers$sd_lon[cluster])
)The score is the signed distance (in miles) from each voter to the reservation boundary: positive for voters inside the reservation, negative for voters outside, with zero at the boundary.
# Distance to the reservation boundary (unsigned, in miles)
dist_miles <- dist_to_boundary(
voters, sandia,
voter_coords = c("lat", "lon"),
units = "miles",
progress = FALSE
)
# Determine which voters fall inside the reservation
voters_sf <- st_as_sf(voters, coords = c("lon", "lat"), crs = 4326)
inside <- lengths(st_intersects(voters_sf, sandia)) > 0
# Signed score: positive inside, negative outside
voters$score <- ifelse(inside, dist_miles, -dist_miles)
voters$inside <- inside
cat("Voters inside reservation:", sum(inside), "\n")
#> Voters inside reservation: 15869
cat("Voters outside reservation:", sum(!inside), "\n")
#> Voters outside reservation: 34131
summary(voters$score)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -10.3429 -3.0934 -1.2068 -1.2683 0.5282 4.5446We simulate a binary turnout variable where the probability of voting is higher outside the reservation than inside, with a discontinuous jump at the boundary. A mild gradient in distance adds realism.
# Turnout probability: ~62% far outside, dropping to ~45% just inside
turnout_prob <- 0.62 + 0.005 * voters$score # gentle gradient
turnout_prob[inside] <- turnout_prob[inside] - 0.15 # discontinuity
turnout_prob <- pmin(pmax(turnout_prob, 0.10), 0.90) # clamp
voters$voted <- rbinom(n, 1, turnout_prob)
cat("Overall turnout rate:", round(mean(voters$voted), 3), "\n")
#> Overall turnout rate: 0.565
cat("Turnout inside: ", round(mean(voters$voted[inside]), 3), "\n")
#> Turnout inside: 0.481
cat("Turnout outside: ", round(mean(voters$voted[!inside]), 3), "\n")
#> Turnout outside: 0.604The score distribution shows voter density on each side of the boundary. The reservation interior (positive scores) is sparsely populated relative to the surrounding communities.
ggplot(voters, aes(x = score)) +
geom_histogram(
aes(fill = inside),
bins = 100, alpha = 0.8, boundary = 0
) +
geom_vline(xintercept = 0, linetype = "dashed", linewidth = 0.8) +
scale_fill_manual(
values = c("FALSE" = "steelblue", "TRUE" = "firebrick"),
labels = c("Outside reservation", "Inside reservation"),
name = NULL
) +
labs(
x = "Distance to reservation boundary (miles)",
y = "Number of voters",
title = "Score variable: signed distance to Sandia Pueblo boundary"
) +
theme_minimal() +
theme(legend.position = "top")# Plot a random sample of 5,000 voters for readability
set.seed(99)
voter_sample <- voters[sample(n, 5000), ]
ggplot() +
geom_point(
data = voter_sample,
aes(x = lon, y = lat, color = score),
size = 0.4, alpha = 0.6
) +
geom_sf(
data = sandia,
fill = NA, color = "black", linewidth = 1
) +
scale_color_gradient2(
low = "steelblue", mid = "grey90", high = "firebrick",
midpoint = 0, name = "Score\n(miles)"
) +
labs(
x = "Longitude", y = "Latitude",
title = "Simulated voters around Sandia Pueblo reservation",
subtitle = "Red = inside reservation, Blue = outside"
) +
coord_sf(
xlim = c(-106.78, -106.35),
ylim = c(35.08, 35.42)
) +
theme_minimal()rdrobustWe estimate the local average treatment effect at the boundary using
rdrobust(). The running variable is the signed distance
score and the cutoff is zero.
rd <- rdrobust(y = voters$voted, x = voters$score, c = 0)
summary(rd)
#> Call: rdrobust
#>
#> Sharp RD estimates using local polynomial regression.
#>
#> Number of Obs. 50000
#> BW type mserd
#> Kernel Triangular
#> VCE method NN
#>
#> Left Right
#> Number of Obs. 34131 15869
#> Eff. Number of Obs. 10951 8250
#> Order est. (p) 1 1
#> Order bias (q) 2 2
#> BW est. (h) 1.457 1.457
#> BW bias (b) 2.275 2.275
#> rho (h/b) 0.641 0.641
#> Unique Obs. 34131 15869
#>
#> =====================================================================
#> Point Robust Inference
#> Estimate z P>|z| [ 95% C.I. ]
#> ---------------------------------------------------------------------
#> RD Effect -0.140 -7.732 0.000 [-0.178 , -0.106]
#> =====================================================================The estimated coefficient represents the discontinuous change in turnout probability at the reservation boundary. A negative estimate indicates lower turnout just inside the reservation relative to just outside.
The rdplot() function visualizes the local polynomial
fit on each side of the cutoff, with binned means showing the underlying
data pattern.
rdplot(
y = voters$voted,
x = voters$score,
c = 0,
title = "Geographic RD: Voter turnout at Sandia Pueblo boundary",
x.label = "Distance to reservation boundary (miles)",
y.label = "Voter turnout"
)