gregers kjerulf dubrow - Tidy Tuesday, November 24, 2020

Trail-head lake in the Cascade Mountains.

Diving deep into the Tidy Tuesday pool…

Every week I’m so impressed by the amount of beautiful, creative and informative analyses and data viz efforts that spring from #TidyTuesday, the weekly data project that evolved out of the R4DS Learning Community. And every week I say to myself, “I should really give it a go, grab the data and post some results”. Then life intervenes, work gets crazy, and before I know it…

Well, around Thanksgiving, with some slack time and some inspiration, I finally got around to it. A resulting benefit was it got me motivated to finally get around to getting this blog set up. My github repo has a folder for any tidytuesday projects I get around to.

Here’s the code and resulting charts & tables.

# load packages
library(tidytuesdayR) # to load tidytuesday data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(patchwork) # stitch plots together
library(gt) # lets make tables
library(RColorBrewer) # colors!
library(scales) # format chart output

First let’s read in the file using the tidytuesdayR package. We’ll also look at the raw data

tt_watrail <- tt_load("2020-11-24")
#> 
#>  Downloading file 1 of 1: `hike_data.rds`

glimpse(tt_watrail$hike_data)
#> Rows: 1,958
#> Columns: 8
#> $ name        <chr> "Lake Hills Greenbelt", "Snow Lake", "Skookum Flats", "Ten…
#> $ location    <chr> "Puget Sound and Islands -- Seattle-Tacoma Area", "Snoqual…
#> $ length      <chr> "2.3 miles, roundtrip", "7.2 miles, roundtrip", "7.8 miles…
#> $ gain        <chr> "50", "1800", "300", "1585", "500", "500", "425", "450", "…
#> $ highpoint   <chr> "330.0", "4400.0", "2550.0", "2370.0", "1000.0", "2200.0",…
#> $ rating      <chr> "3.67", "4.16", "3.68", "3.92", "4.14", "3.14", "5.00", "2…
#> $ features    <list> <"Dogs allowed on leash", "Wildlife", "Good for kids", "L…
#> $ description <chr> "Hike through a pastoral area first settled and farmed in …

There are a few things we want to do with the data for the working dataframe:

create columns for miles, direction, type from length
create specific location columns frolm location
change rating, gain and highpoint to numeric
create a rating group
change features to character vector, also unnest; makes the resulting df long. we’ll use distinct when we only need 1 obs per trail

tt_watraildf <- tt_watrail$hike_data %>%
  mutate(length_miles = parse_number(length)) %>%
  mutate(across(gain:rating, as.numeric)) %>%
  mutate(rating_grp = case_when(rating == 0 ~ "0",
                                rating >0 & rating < 2 ~ "1",
                                rating >=2 & rating < 3 ~ "2",
                                rating >=3 & rating < 4 ~ "3",
                                rating >=4 & rating < 5 ~ "4",
                                rating == 5 ~ "5")) %>%
  mutate(trail_type = case_when(grepl("roundtrip", length) ~ "Round trip",
                          grepl("one-way", length) ~ "One Way",
                          grepl("of trails", length) ~ "Trails")) %>% 
  mutate(location_split = location) %>%
  separate(location_split, c("location_region","location_specific"), sep = ' -- ') %>%
  mutate(features = lapply(features, sort, na.last = TRUE)) %>%
  mutate(feature_v = sapply(features,FUN = function(x) if (all(is.na(x))) NA else paste(x,collapse = ", "))) %>%
  mutate(feature_v = str_trim(feature_v)) %>%
  mutate(features_unnest = features) %>%
  unnest(cols = c(features_unnest), keep_empty = TRUE) %>% 
  mutate(feature_v = ifelse(is.na(feature_v), "none", feature_v)) %>%
  mutate(features_unnest = ifelse(is.na(features_unnest), "none", features_unnest)) %>%
  mutate(feature_init = case_when(features_unnest == "Dogs allowed on leash" ~ "DA",
                                  features_unnest == "Dogs not allowed" ~ "DN",
                                  features_unnest == "Wildlife" ~ "Wl",
                                  features_unnest == "Good for kids" ~ "GK",
                                  features_unnest == "Lakes" ~ "Lk",
                                  features_unnest == "Fall foliage" ~ "FF",
                                  features_unnest == "Ridges/passes" ~ "RP",
                                  features_unnest == "Established campsites" ~ "EC",
                                  features_unnest == "Mountain views" ~ "MV",
                                  features_unnest == "Old growth" ~ "OG",
                                  features_unnest == "Waterfalls" ~ "Wf",
                                  features_unnest == "Wildflowers/Meadows" ~ "WM",
                                  features_unnest == "Rivers" ~ "Ri",
                                  features_unnest == "Coast" ~ "Co",
                                  features_unnest == "Summits" ~ "Su")) %>%
  mutate(feature_init = ifelse(is.na(feature_init), "none", feature_init)) %>%
  mutate(feature_type = if_else(feature_init %in% c("DA","DN","GK"), "Companion", "Feature")) %>%
  mutate(feature_type = ifelse(feature_init == "none", "none", feature_type)) %>%
  group_by(name) %>%
  mutate(feature_n = n()) %>%
  ungroup() %>%
  mutate(feature_n = ifelse(feature_init == "none", 0, feature_n)) %>%
  select(name, location_region, location_specific, trail_type, length_miles, 
         gain, highpoint, rating, rating_grp, features, feature_v, features_unnest, 
         feature_init, feature_type, feature_n, description, location, length)

To get a sense of what the data look like, I’ll run some historgrams and scatterplots to see how things cluster, if there are outliers or anything else especially noticable.

Using log10 for the length scale to even out the spread. The patchwork package stitches the plots together in a neat panel.

hist_length <-
tt_watraildf %>%
  distinct(name, .keep_all = TRUE) %>%
  ggplot(aes(length_miles)) +
  geom_histogram(alpha = 0.8) +
  scale_x_log10() +
  labs(x = "Length (miles), log10")

hist_gain <-
  tt_watraildf %>%
  distinct(name, .keep_all = TRUE) %>%
  ggplot(aes(gain)) +
  geom_histogram(alpha = 0.8) +
  scale_x_log10()

hist_high <-
  tt_watraildf %>%
  distinct(name, .keep_all = TRUE) %>%
  ggplot(aes(highpoint)) +
  geom_histogram(alpha = 0.8) 

hist_rate <-
  tt_watraildf %>%
  distinct(name, .keep_all = TRUE) %>%
  ggplot(aes(rating)) +
  geom_histogram(alpha = 0.8) 

(hist_length | hist_gain) /
  (hist_high | hist_rate)

For the scatterplots, I plotted length by gain, faceting by ratings groups and then by region. We do have to be careful with ratings, as they are user-generated and some trails have very few votes. Log10 used again for length.

tt_watraildf %>%
  distinct(name, .keep_all = TRUE) %>%
  ggplot(aes(length_miles, gain)) +
  geom_point() +
  geom_smooth() +
  scale_x_log10() +
  labs(x = "Length (miles) log10", y = "Total Gain",
       title = "Length v Gain, by Rating Group") +
  facet_wrap(vars(rating_grp))


tt_watraildf %>%
  distinct(name, .keep_all = TRUE) %>%
  ggplot(aes(length_miles, gain)) +
  geom_point() +
  geom_smooth() +
  scale_x_log10() +
  labs(x = "Length (miles) log 10", y = "Total Gain",
       title = "Length v Gain, by Region") +
  facet_wrap(vars(location_region))

The outliers in terms of gain & length clustered in a few regions, so I wanted to see which they were. Not a surprise they clustered in the Cascades & Rainier.

tt_watraildf %>%
  distinct(name, .keep_all = TRUE) %>%
  filter(gain > 15000) %>%
  filter(length_miles > 90) %>%
  select(location_region, name, length_miles, gain) %>%
  arrange(name) 
#> # A tibble: 5 × 4
#>   location_region      name                                   length_miles  gain
#>   <chr>                <chr>                                         <dbl> <dbl>
#> 1 Southwest Washington Pacific Crest Trail (PCT) Section H -…         148. 27996
#> 2 South Cascades       Pacific Crest Trail (PCT) Section I -…          99  17771
#> 3 Central Cascades     Pacific Crest Trail (PCT) Section K -…         117  26351
#> 4 North Cascades       Pacific Northwest Trail - Pasayten Tr…         119  21071
#> 5 Mount Rainier Area   Wonderland Trail                                93  22000

Now that we see how the length, gain, highpoint & ratings spread out, I want build a table to see the averages by region.

I’ve been wanting to take a deeper dive into gt & reactable. I’ve got some basic gt calls down, but for this excercise I wanted to learn how to conditionally format columns based on value. So inspired by Thomas Mock’s gt primer, a basic table with heatmap-like formatting for some columns. See his explainer for details on the code, and for more features than I’m including.

# create by region averages df
byregion <-  tt_watraildf %>%
  distinct(name, .keep_all = TRUE) %>%
  group_by(location_region) %>%
  summarise(n_region = n(),
            avglength = mean(length_miles),
            avgrating = mean(rating),
            avggain = mean(gain),
            avghigh = mean(highpoint),
            minhigh = min(highpoint),
            maxhigh = max(highpoint)) %>%
  mutate_at(vars(avglength:avgrating), round, 2) %>%
  mutate_at(vars(avggain:avghigh), round, 0) 

byregion %>%
  gt() %>%
  fmt_number(columns = vars(avggain, avghigh, minhigh, maxhigh), decimals = 0, use_seps = TRUE) %>%
  # sets the columns and palette to format cell color by value range
  data_color(
    columns = vars(avglength, avgrating, avggain, avghigh, minhigh, maxhigh),
    colors = scales::col_numeric(
      palette = c("#ffffff", "#f2fbd2", "#c9ecb4", "#93d3ab", "#35b0ab"),
      domain = NULL)) %>%
#  tab_style calls add border boxes first to column labels, then body cells
  tab_style(
    style = list(
      cell_borders(
        sides = "all", color = "grey", weight = px(1))),
    locations = list(
      cells_column_labels(
        columns = gt::everything()
        ))) %>%
  tab_style(
    style = list(
      cell_borders(
        sides = "all", color = "grey", weight = px(1))),
    locations = list(
      cells_body(
        rows = gt::everything()
      ))) %>%
    cols_align(columns = TRUE, align = "center") %>%
  cols_align(columns = "location_region", align = "left") %>%
  cols_width(vars(location_region) ~ px(150),
             vars(n_region) ~ px(60),
             starts_with("avg") ~ px(80),
             ends_with("high") ~ px(90)
             ) %>%
  tab_header(title = "Regional Averages",
             subtitle = md("_North Cascades have longest trails,
                           all mountain areas have lots of gain and highest points_")) %>%
  cols_label(location_region = "Region", n_region = "N", avglength = "Avg Length (miles)",
            avgrating = "Avg Rating", avggain = "Avg Gain (ft)",avghigh = "Avg high point",
             minhigh = "Lowest high point", maxhigh = "Max high point")

Region	N	Avg Length (miles)	Avg Rating	Avg Gain (ft)	Avg high point	Lowest high point	Max high point
Regional Averages
North Cascades have longest trails, all mountain areas have lots of gain and highest points
Central Cascades	219	9.53	3.04	2,276	4,752	600	9,511
Central Washington	79	5.71	2.78	823	2,260	240	6,876
Eastern Washington	142	9.33	2.15	1,592	4,410	300	7,310
Issaquah Alps	76	5.03	2.53	984	1,518	250	3,004
Mount Rainier Area	193	8.19	3.35	1,881	5,222	800	10,080
North Cascades	292	11.24	3.08	2,535	5,111	125	9,200
Olympic Peninsula	209	8.13	3.32	1,572	2,821	20	6,988
Puget Sound and Islands	190	4.25	2.80	452	573	10	3,750
Snoqualmie Region	216	8.71	3.15	2,198	4,467	450	9,416
South Cascades	188	8.44	3.07	1,641	4,732	922	12,276
Southwest Washington	120	6.58	2.69	1,171	1,774	20	7,800

Now let’s look at the effect of trail features on rating.

First we’ll look at average rating by feature, then fit a model. First, a scatter-plot of number of features listed for a trail with user rating. Looks like at a certain point, it’s diminshing returns on trail features in terms of effect on rating.

tt_watraildf %>%
  distinct(name, .keep_all = TRUE) %>%
  ggplot(aes(feature_n, rating)) +
  geom_point() +
  geom_smooth() +
  labs(x = "# of features on a trail", y = "User rating",
       title = "Features and Rating by Trail Region") +
  facet_wrap(vars(location_region))

Here’s a table similar to the one for averages by region. I used the unnested features, so trails will be represented more than once. Dog-free trails do get the highest ratings, but it’s likely because they also tend to have highest high points, so offer views, are challenging, and so get good ratings.

byfeature <- 
tt_watraildf %>%
  group_by(features_unnest) %>%
  summarise(n_feature = n(),
            avgrating = mean(rating),
            avglength = mean(length_miles),
            avggain = mean(gain),
            avghigh = mean(highpoint),
            minhigh = min(highpoint),
            maxhigh = max(highpoint)) %>%
  mutate_at(vars(avglength:avgrating), round, 2) %>%
  mutate_at(vars(avggain:avghigh), round, 0) %>%
  arrange(desc(avgrating))

## create table
byfeature %>%
  gt() %>%
  fmt_number(columns = vars(n_feature, avggain, avghigh, minhigh, maxhigh), decimals = 0, use_seps = TRUE) %>%
  # sets the columns and palette to format cell color by value range
  data_color(
    columns = vars(avglength, avgrating, avggain, avghigh, minhigh, maxhigh),
    colors = scales::col_numeric(
      palette = c("#ffffff", "#f2fbd2", "#c9ecb4", "#93d3ab", "#35b0ab"),
      domain = NULL)) %>%
  # tab_style calls add border boxes first to column labels, then body cells
  tab_style(
    style = list(
      cell_borders(
        sides = "all", color = "grey", weight = px(1))),
    locations = list(
      cells_column_labels(
        columns = gt::everything()
      ))) %>%
  tab_style(
    style = list(
      cell_borders(
        sides = "all", color = "grey", weight = px(1))),
    locations = list(
      cells_body(
        rows = gt::everything()
      ))) %>%
  tab_header(title = "Averages by Feature",
             subtitle = md("_Dog-free trails with waterfalls & high peaks earn high ratings_")) %>%
  cols_align(columns = TRUE, align = "center") %>%
  cols_align(columns = "features_unnest", align = "left") %>%
  cols_width(vars(features_unnest) ~ px(150),
             vars(n_feature) ~ px(60),
             starts_with("avg") ~ px(80),
             ends_with("high") ~ px(90)
             ) %>%
  cols_label(features_unnest = "Feature", n_feature = "N", avglength = "Avg Length (miles)",
             avgrating = "Avg Rating", avggain = "Avg Gain (ft)",avghigh = "Avg high point",
             minhigh = "Lowest high point", maxhigh = "Max high point")

Feature	N	Avg Rating	Avg Length (miles)	Avg Gain (ft)	Avg high point	Lowest high point	Max high point
Averages by Feature
Dog-free trails with waterfalls & high peaks earn high ratings
Dogs not allowed	255	3.52	9.28	1,921	4,173	10	10,080
Waterfalls	282	3.46	9.64	1,938	3,648	150	10,080
Established campsites	396	3.40	12.80	2,380	4,487	25	12,276
Ridges/passes	496	3.25	12.24	2,864	5,575	400	9,511
Lakes	583	3.19	9.92	1,988	4,231	20	9,511
Old growth	534	3.16	9.00	1,746	3,364	25	8,096
Mountain views	1,175	3.13	9.72	2,201	4,621	20	12,276
Summits	454	3.11	10.40	2,854	5,250	200	12,276
Wildflowers/Meadows	952	3.10	9.26	1,967	4,243	10	9,511
Rivers	547	3.05	9.76	1,731	3,205	10	12,276
Good for kids	694	3.01	4.63	569	2,080	10	8,245
Wildlife	747	3.00	8.84	1,541	3,241	10	10,080
Dogs allowed on leash	1,045	2.94	7.04	1,379	3,183	20	12,276
Fall foliage	508	2.94	8.28	1,618	3,334	20	9,249
Coast	106	2.89	4.12	351	433	10	6,454
none	68	2.38	7.26	1,758	3,849	60	8,970

And finally a quick model to see what might affect a trail rating.

It’s a simple linear model using length, gain, highpoint, & number of features to predict rating. The elevation of the highest point and number of features are both significant. I’d need to do more digging to see what the power of the estimate is on the rating. It’s also slightly counter-intuitive given that we saw in the charts that length, elevation and gain seem to positively affect rating. But then the model only accounts for 4% of varaince, so it’s not telling us much.

# creat df with distinct observations for each trail 
tt_watraildf_dist <- tt_watraildf %>%
  distinct(name, .keep_all = TRUE) 

wtmodel1 <- lm(rating ~ length_miles + gain + highpoint + feature_n, data = tt_watraildf_dist)
summary(wtmodel1)
#> 
#> Call:
#> lm(formula = rating ~ length_miles + gain + highpoint + feature_n, 
#>     data = tt_watraildf_dist)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.6984 -0.3776  0.3716  0.9284  2.4565 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.205e+00  8.942e-02  24.663  < 2e-16 ***
#> length_miles -6.565e-03  5.678e-03  -1.156    0.248    
#> gain         -3.590e-05  3.100e-05  -1.158    0.247    
#> highpoint     8.318e-05  1.742e-05   4.775 1.93e-06 ***
#> feature_n     1.272e-01  1.484e-02   8.576  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.398 on 1919 degrees of freedom
#> Multiple R-squared:  0.0488, Adjusted R-squared:  0.04682 
#> F-statistic: 24.61 on 4 and 1919 DF,  p-value: < 2.2e-16

There’s plenty more to do with the set, and some responses I’ve seen on Twitter have been creative…network graphs, better models…but I was able to brush up on gt, learned how to unnest and keep obs where the list was empty. So a successful #tudytuesday.

This post was last updated on 2023-05-18