30 Day Chart Challenge 2025

Let’s make some charts!
post
rstats
ggplot
dataviz
chartchallenge
denmark
education
higher education
Author

gregers kjerulf dubrow

Published

April 21, 2025

image listing the prompts for the 30 day chart challenge 2025.

The Plan for the 2025 Challenge

Quick note - if you’re reading this before April 1 it’s because due to something I need to do on April 2 that might keep me away from the computer for a day or two, I’m getting ahead of the challenge by doing some charts in late March. I’ll have at least 3 done, if not more. To at least play within the spirit of the challenge I won’t post them to social media until April 1 and I’m putting the first publish date on blog posts as April 1.

It’s time again for the 30 Day Chart Challenge, a social-media-driven data viz challenge to create and share a data visualization on any topic, but somehow corresponding to the daily prompt as suggested by the organizers.

I did participate last year, though started late and did not get to every prompt. Like last year I am going to keep all the prompt posts in one mega-post, adding to it regularly. Perhaps not each day, but I will try to get to every prompt. Like last year:

  • I will post contributions to social media via my Bluesky, and LinkedIn accounts.
  • I will add contributions to this post, meaning the posting date will refresh to the date I’ve added the chart. So If I get to a lot or even all of them, this will be a long post.
  • Use the table of contents to the right to navigate to a specific chart.
  • Images are “lightboxed”, so click on them to enlarge. Click esc or outside the image to restore the page.

One major change is that this year I am going to only pull data from Danmarks Statistik (Statistics Danmark, the national statistics agency) via the danstat package for r. To the extent possible I’ll be focusing on education data. These constraints will help me by not having to figure out which data source to use each day, plus I’ll hopefully be telling a story of the coming weeks.

It may not be a straight linear narrative and not every post will necessarily connect to the whole. I may not even know exactly what I have until I’m done. I have sketched out ideas for many of the prompts, but things may change as I go along. If the daily prompt just doesn’t lend itself to Danish education data, or Danish data at all, the subject for that chart will be something else.

As with all of my posts, I’m offering a tutorial/how-to as much as I am trying to convey insights from the data. If a topic or approach interests you, hopefully there’s enough explanation in the writing or the code so that you can do something similar or use it to hone your r skills.

Let’s get to the charts.

Prompt #1 - Fractions

The first week of prompts is devoted to comparisons, and the first day is fractions. I cheated a bit and replicated last year’s 1st prompt “Part-to-Whole”. But where last year I looked at educational attainment in the UK, this year it’s Denmark.

I won’t do too much explaining about education levels in Denmark, you can read up on them on the ministry’s page.

We’ll do a horizontal stacked bar chart displaying the highest education attained by Danes aged 25-69 as of 2023, with this population divided into four age groups.

To pull data from danstat, which accesses the Danmarks Statistik API, you have to know at least the table name and number. So it’s necessary to poke around Danmarks Statistik’s StatBank page and figure out which table has the data you need. The value added of the API and package is not having to download and keep flat files.

The table_meta command produces a nested tibble with values for each field in the table. To get the data you first create a list of fields and values (see the varaibles_ed object) then use the get_data command to actually fetch the data.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(ggtext) # enhancements for text in ggplot

# some custom functions
source("~/Data/r/basic functions.R")


# metadata for table variables, click thru nested tables to find variables and ids for filters
table_meta <- danstat::get_table_metadata(table_id = "hfudd11", variables_only = TRUE)

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "bopomr", values = c("000", "081", "082", "083", "084", "085")),
    list(code = "hfudd", values = c("H10", "H20", "H30", "H35",
                                        "H40", "H50", "H60", "H70", "H80", "H90")),
    list(code = "køn", values = c("TOT")),
    list(code = "alder", values = c("25-29", "30-34", "35-39", "40-44", "45-49",
                                            "50-54", "55-59", "60-64", "65-69")),
    list(code = "tid", values = 2023))

# past variable list along with table name.
# note that in package, table name is lower case, though upper case on statbank page.
edattain1 <- get_data("hfudd11", variables_ed, language = "da") %>%
    as_tibble() %>%
    select(region = BOPOMR, age = ALDER, edlevel = HFUDD, n = INDHOLD)

# arrange data for chart
# collapse age and education levels into groups
edattain <- edattain1 %>%
    mutate(age_group = case_when(age %in% c("25-29 år", "30-34 år", "35-39 år") ~ "25-39",
                age %in% c("40-44 år","45-49 år") ~ "40-49",
                age %in% c("50-54 år","55-59 år") ~ "50-59",
                age %in% c("60-64 år","65-69 år") ~ "60-69")) %>%
    mutate(ed_group = case_when(edlevel == "H10 Grundskole" ~ "Grundskole/Primary",
                edlevel %in% c("H20 Gymnasiale uddannelser",
                "H30 Erhvervsfaglige uddannelser",
                "H35 Adgangsgivende uddannelsesforløb") ~ "Secondary",
                edlevel == "H40 Korte videregående uddannelser, KVU" ~ "Tertiary - 2yr",
                edlevel %in% c("H50 Mellemlange videregående uddannelser, MVU",
                "H60 Bacheloruddannelser, BACH") ~ "Tertiary - Bachelor",
                edlevel == "H70 Lange videregående uddannelser, LVU" ~ "Tertiary - Masters",
                edlevel == "H80 Ph.d. og forskeruddannelser" ~ "Tertiary - PhD",
                edlevel == "H90 Uoplyst mv." ~ "Not stated")) %>%
    group_by(region, age_group, ed_group) %>%
    mutate(n2 = sum(n)) %>%
    ungroup() %>%
    select(-n, -age) %>%
    distinct(region, age_group, ed_group, .keep_all = T) %>%
    rename(n = n2) %>%
    mutate(ed_group =
                factor(ed_group,
                levels = c("Grundskole/Primary", "Secondary", "Tertiary - 2yr",
                "Tertiary - Bachelor", "Tertiary - Masters", 
                "Tertiary - PhD", "Not stated")))

Now that we have a nice tidy tibble, let’s make a chart. The approach is fairly straight-forward ggplot. I used tricks like fct_rev to order the age groups on the chart, and the scales and ggtext packages to make things look nicer.

Show code for making the chart

## chart - all DK, horizontal bar age groups, percent stacks percent by deg level
edattain %>%
    filter(region == "Hele landet") %>%
    group_by(age_group) %>%
    mutate(age_total = sum(n)) %>%
    mutate(age_pct = round(n/age_total, 2)) %>%
    select(age_group, ed_group, age_pct) %>%
    # the line below passes temporary changes to the data object through to the chart commands
    {. ->> tmp} %>%
    ggplot(aes(age_pct, fct_rev(age_group), fill = fct_rev(ed_group))) +
    geom_bar(stat = "identity") +
    scale_x_continuous(expand = c(0,0),
                breaks = c(0, 0.25, 0.50, 0.75, 1),
                labels = c("0", "25%", "50%", "75%", "100%")) +
    geom_text(data = subset(tmp, age_pct >0.025),
                aes(label = scales::percent(round(age_pct , 2))),
                position = position_stack(vjust = 0.5),
                color= "white", vjust = 0.5, size = 12) +
    scale_fill_brewer(palette = "Set3") +
    labs(title = "Danes under 40 have a higher rate of post-Bachelor educational attainment than other age groups",
             subtitle = "Highest education level attained by age groups",
             caption = "*Data from Danmarks Statistik via danstat package*",
             x = "", y = "Age Group") +
    theme_minimal() +
    theme(legend.position = "bottom", legend.spacing.x = unit(0, 'cm'),
                legend.key.width = unit(4, 'cm'), legend.margin=margin(-10, 0, 0, 0),
                legend.text = element_text(size = 12), legend.title = element_text(size = 16),
                plot.title = element_text(hjust = .5, size = 20),
                plot.subtitle = element_text(size = 16),
                plot.caption = element_markdown(size = 12, face = "italic"),
                axis.text.x = element_text(size = 14),
                axis.text.y = element_text(size = 14),
                panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
    guides(fill = guide_legend(label.position = "bottom", reverse = TRUE, 
                direction = "horizontal", nrow = 1, 
                title = "Highest Educational Attainment", title.position = "top")) 
rm(tmp)

horizontal bar chart showing educational attainment by age group

As you might expect, younger Danes tend to have higher levels of educational attainment than Danes over 60. That’s not uncommmon in countries with post-secondary education policies designed to increase access and attainment. As in the US this is a post-WWII development, though accelerated in the 1960s and after.

It’s been successful public policy in Denmark for many decades to increase access to schooling beyond the Grundskole (primary years…to about grade 9 in a US context). More than 45% of Danes between 25-40 have at least a Bachelor’s degree, and 20% have a Master’s. There would of course be more nuance in disaggregated age groups, but I wanted to keep this first story a bit more simple.

created and posted March 24, 2025

Prompt #2 - Slope

Building on what we saw in Prompt 1 for educational attainment by age group in 2023, let’s use the slope prompt to compare 2023 to 2005, the furthest back we can get educational attainment by age statistics via Danmarks Statistik’s StatBank.

The process for getting data is more or less the same as in Prompt 1, so I won’t repeat the code here. The only thing I did differently was collapse the 2-year and Bachelor’s degrees into one “college” category and collapsed Master’s and PhD into one post-bacc group.

For this chart I want to see if there are differences in highest educational attainment level by age group in two distinct years. Instead of one chart with too many lines, essentially a spaghetti graph, I made four charts and used the patchwork package (created by a Dane!) to facet out the individual charts into a (hopefully) coherent whole.

Keep in mind that while of course many people will be represented in both 2005 and 2023, this is not a longitudinal study. It’s a snapshot of the Danish population by age groups at two distinct points in time.

We have the data, and it’s time to build the charts. But instead of copy-pasting the ggplot code to build each charts, let’s create a function and use that four times.

Show code for the chart function
# plotdf is the placeholder name in the function for the data you pass to run it
slope_graph <- function(plotdf) {
    plotdf %>%
    ggplot(aes(x = year, y = age_ed_pct, group = age_group, color = age_group)) +
        geom_line(size = 1) +
        geom_point(size = .2) +
        scale_x_continuous(
            breaks = c(2005, 2023),
            labels = c("2005", "2023")) +
        scale_y_continuous(limits = c(0, .5),
            labels = scales::label_percent()) +
        scale_color_brewer(palette = "Set2") +
        labs(x = "", y = "") +
        theme_minimal() +
        theme(legend.position = "bottom", legend.spacing.x = unit(0, 'cm'),
            legend.key.width = unit(3, 'cm'), legend.margin=margin(-10, 0, 0, 0),
            legend.text = element_text(size = 10), legend.title = element_text(size = 12),
            plot.title = element_text(hjust = .5, size = 16),
            plot.subtitle = element_markdown(size = 14, vjust = -.5),
            plot.caption = element_markdown(size = 12, face = "italic"),
            axis.text.x = element_text(size = 11),
            axis.text.y = element_text(size = 11),
            panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
            strip.background = element_blank()) +
        guides(color = guide_legend(label.position = "bottom", reverse = FALSE, 
            direction = "horizontal", nrow = 1,
            title = "Age Group", title.position = "top"))
}

Now let’s put the function to use. We’ll create four plots, then stitch them together. Each plot will have its own descriptive title and subtitle, and the final chart will have one as well. Notice that we’re using the titles to tell a quick story or share an insight, not to dryly name what the chart is doing. The ggtext package gives us markdown enhancements for chart elements like the title and subtitle.

Show code for making the final chart
# create individual plots with titles and annotations
plot_grundsk <-
    edattain2 %>%
    filter(ed_group2 == "Grundskole/Primary") %>%
    slope_graph() +
    labs(title = "Primary school (thru grade 10)",
             subtitle = "*In 2023 Danes of all ages were much less likely to have stopped their education at<br>primary school
             than they were in 2005.<br>*")

plot_hs <-
    edattain2 %>%
    filter(ed_group2 == "Secondary") %>%
    slope_graph() +
    labs(title = "Gymnasium & Vocational (High School)",
             subtitle = "*From 2005 to 2023, there was a steep decline in the percentage of Danes<br>
             aged 25-49 who were finished with education at the high school level, especially<br>
             Danes under 40. For Danes older than 50 there was a very slight increase.*")

plot_colldegs <-
    edattain2 %>%
    filter(ed_group2 == "Tertiary - 2yr/Bach") %>%
    slope_graph() +
    labs(title = "2-year & Bachelor's Degrees",
             subtitle = "*For Danes of all age groups, but especially those under 50, there was a<br>
             noticable increase in the percentage earning 2 or 4 year degrees.*")

plot_masters <-
    edattain2 %>%
    filter(ed_group2 == "Tertiary - Masters+") %>%
    slope_graph() +
    labs(title = "Master's & PhD Degrees",
             subtitle = "*The percentage of Danes earning a Master's or PhD increased across all age groups between 2005 and 2025;<br>
             the increase was strongest among Danes under 50.*")

plot_grundsk + plot_hs + plot_colldegs + plot_masters +
    plot_annotation(
        title = "Danes of all ages have become more likely to continue their education beyond primary level. Danes under 50 have over time become likely to earn a Master's.",
        subtitle = "*Highest level of education earned by age groups, in 2005 and 2023.*",
        caption = "*Data from Danmarks Statistik via danstat package. Groups are not longitudinal - age is for the person in the year of data collection.*") &
    theme(plot.title = element_text(size = 16), plot.subtitle = element_markdown(),
        plot.caption = element_markdown(),
        # plot.background add the grey lines between plots
        plot.background = element_rect(colour = "grey", fill=NA))

four slope graphs displaying educational attainment by age group

The story here is hopefully clear. In 2005, more Danes, especially Danes 60+ years of age, had ended their education at primary level. By 2023 many more Danes were likely to have progressed beyond the primary level, and younger Danes especially were finishing Bachelor’s degrees and earning Master’s degrees.

It’s been government policy to support furthering education levels via free tuition and living stipends. This is made possible by a democratic-socialism approach to the economy, with higher tax rates on some earners than you might see in the US or UK, but for those taxes you received and your kids get at least 4 years of college paid for, and a small living stipend to boot.

In the US public higher education affordability models vary from state to state. Some states try to keep cost-of-attendance low via more direct state support. But not all students get full rides, even with need-based aid. I am much more a fan of the socialized approach to college affordability. the data shows it works in terms of increasing overall educational attainment.

created and posted March 24, 2025

Prompt #3 - Circular

Went wild with the circles here, sixty of them altogether. I took inspiration from this Albert Rapp post on using a bubble chart in place of a heat map.

We are still pulling data from from Danmarks Statistik via the danstat package. This time we’ll use the “Labour & income” domain to look at earnings by employment sector and highest level of education attained, the LONS11 table; løn is the Danish word for wage or salary. Of the twenty-five available earnings indicators, we’re using the MDRSNIT field, which is standardized monthly earnings. I couldn’t easily find the documentation on exactly how it was imputed, but given the analysis is across sectors, I thought it made the most sense. We’ll also pull in whether the employee was salaried or hourly (spoiler alert…it matters).

Let’s first get and clean the data.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(ggtext) # enhancements for text in ggplot

# some custom functions
source("~/Data/r/basic functions.R")

# lons11
table_meta <- danstat::get_table_metadata(table_id = "lons11", variables_only = TRUE)

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "uddannelse", values = c("H10", "H20", "H30", "H40", "H50", "H60", "H70", "H80")),
    list(code = "sektor", values = c(1016, 1020, 1025, 1046)),
    list(code = "lønmål", values = "MDRSNIT"), # avg monthly
#   list(code = "køn", values = c("M", "K")),
  list(code = "afloen", values = c("TIME", "FAST")),
    list(code = "tid", values = 2023))

sal1 <- get_data("lons11", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

# earnings came in character form, so changing to numeric and rounding up to whole numbers
# group the ed levels and change the names, and change names on sector. make them factors with orders.
sal2 <- sal1 %>%
    mutate(income = as.numeric(indhold)) %>%
    mutate(income = round(income, 2)) %>%
    mutate(sector = case_when(
        sektor == "Corporations and organizations" ~ "Private sector",
        sektor == "Government including social security funds" ~ "Gov't - National",
        sektor == "Municipal government" ~ "Gov't - Municipal",
        sektor == "Regional government" ~ "Gov't - Regional")) %>%
    mutate(sector =
        factor(sector,
        levels = c("Gov't - Municipal", "Gov't - Regional",
        "Gov't - National", "Private sector"))) %>%
    mutate(ed_level =
        case_when(uddannelse == "H10 Primary education" ~ "Primary",
        uddannelse == "H20 Upper secondary education" ~ "HS-Academic",
        uddannelse == "H30 Vocational Education and Training (VET)" ~
        "HS-Vocational",
        uddannelse == "H40 Short cycle higher education" ~
        "Short-cycle college",
        uddannelse == "H50 Vocational bachelors educations" ~
        "Bachelor-Vocational",
        uddannelse == "H60 Bachelors programs" ~ "Bachelor-Academic",
        uddannelse == "H70 Masters programs" ~ "Masters",
        uddannelse == "H80 PhD programs" ~ "PhD")) %>%
    mutate(ed_level =
        factor(ed_level,
        levels = c("Primary", "HS-Academic", "HS-Vocational",
        "Short-cycle college", "Bachelor-Vocational", 
        "Bachelor-Academic", "Masters", "PhD")))

Ok, we have the data, onto making the circles. Again, we’re doing a bubble chart that functions as a heat map by pegging the bubble size to the value of the earnings. I also added text labels next to the bubbles.

Show code for making the chart
sal2 %>%
    ggplot(aes(x = sector, y = ed_level)) +
    geom_point(aes(col = income, fill = income, size = income), shape = 21) +
    theme_minimal() +
    scale_size_area(max_size = 15) +
    labs(x = "", y = "",
             title = "More education = higher monthly earnings across all sectors. Larger effect for salaried vs hourly workers",
             subtitle = "Monthly standardized earnings by education level & earner category",
             caption = "*Data from Danmarks Statistik via danstat package*") +
    facet_wrap(~ afloen) +
    theme(legend.position = 'top',
                legend.justification = c(.95,0),
                plot.title = element_text(size = 20),
                plot.subtitle = element_text(size = 16),
                plot.caption = element_markdown(size = 12, face = "italic")) +
    guides(
        col = guide_none(),
        size = guide_none(),
        fill = guide_colorbar(
            barheight = unit(0.5, 'cm'),
            barwidth = unit(10, 'cm'),
            title.position = 'top',
            title = "Standardized Monthly Earnings (Danish Kroner)")) +
    scale_fill_continuous(limit = c(25000, 80000),
        breaks = c(30000, 40000, 50000, 60000, 70000, 80000)) +
    geom_text(data = subset(sal2, !is.na(income)),
        aes(label =
                    paste0(round(income, 0), " DKK")), nudge_x = 0.35, size = 3)

I see the story here being that education level matters when it comes to how much Danish workers earn in a month. If you work in the private sector, having a Masters 8,000 to 10,000 more DKK per month than Bachelor’s degrees. The difference is even greater working for regional government agencies. Also, hourly workers with the same education earn less per month than salaried workers. But without the component parts of the total earnings I can’t explain exactly why.

The bottom line is, as we’re seeing so far in the charts, there’s both an increase in the number of people earning college and post-bacc degrees, and the payoff seems to be there for the extra year of school it takes for most programs.

But again, given that education is socialized and subsidized through taxes, the opportunity cost for that extra year is nothing, because the basic higher education coverage is for five years, for tuition and living stipend. And higher wage earners pay more taxes, helping to boost everyone’s opportunity to earn the degrees they desire, without being worried about cost.

It’s worth mentioning here, that there’s discussion ongoing in the Danish government to squeeze education spending and spots available in the gymnasiums (essentially high school) that feed university programs. I need to read up on that a bit more as I tell this Danish education story while doing the chart challenge.

Oh, the exchange rate is usually around 6.5 or so DKK to USD. So while it may seem amazing to earn 50000 per month, and while that’s not bad in this market, adjust your bearings if you’re thinking in USD or EURO.

bubble chart displaying monthly earnings by educational attainment and employment sector

created and posted April 2, 2025

Prompt #4 - Big or Small

For this post I’ll stay with a circle theme and make a simple circle packing chart, sort of a heat map or tree map but with circles, the size of the circles corresponding to the variable value. I’m using the code-through from the R Graph Gallery to get going, as I’ve never done this type of chart before. The packcircles package does the computational work to create the circle objects.

Keeping with the thematic constraint I’ve set, I’ll once again be using Danish education data from Danmarks Statistik via the danstat package.

This chart will display Bachelor’s degrees awarded in 2023, by top-line degree field (Science, Humanities, Education, etc). For instance, Law is a sub-field in Social Sciences, one of many. To make it a bit easier I’ll limit this chart to academic Bachelor’s, not the vocational degrees. There’s definitely a more focused project worth doing that displays the sub-fields within the top-line areas, shown as nested smaller circles within larger circles, or something interactive. for now this will suffice.

To start, let’s get the data. It’s from a different table than the last chart, so I’ll show the work.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(packcircles) # makes the circles!
library(ggtext) # enhancements for text in ggplot
library(ggrepel) # helps with label offsets

# some custom functions
source("~/Data/r/basic functions.R")

# this table is UDDAKT60, set to all lower-case in the table_id call
table_meta <- danstat::get_table_metadata(table_id = "uddakt60", variables_only = TRUE)

# create variable list using the ID value in the variable
# the uddanelse codes are the top-line degree fields.
variables_ed <- list(
    list(code = "uddannelse", values = c("H6020", "H6025", "H6030", "H6035",
                 "H6039", "H6059", "H6075", "H6080", "H6090")),
    list(code = "fstatus", values = c("F")),
    list(code = "tid", values = 2023))

# start with a larger dataset in case we want to do more analysis or viz with it
degs1 <- get_data("uddakt60", variables_ed, language = "en") %>%
    as_tibble() %>%
    # shorten the given degree field names
    mutate(deg_field = case_when(UDDANNELSE == "H6020 Educational, BACH" ~ "Educ.",
            UDDANNELSE == "H6025 Humanities and theological, BACH" ~ "Humanities",
            UDDANNELSE == "H6030 Arts, BACH" ~ "Arts",
            UDDANNELSE == "H6035 Science, BACH" ~ "Science",
            UDDANNELSE == "H6039 Social Sciences, BACH" ~ "Social Science",
            UDDANNELSE == "H6059 Technical sciences, BACH" ~ "Tech Science",
            UDDANNELSE == "H6075 Food, biotechnology and laboratory technology, BACH" 
                        ~ "Food/Biotech/LabTech",
            UDDANNELSE == "H6080 Agriculture, nature and environment, BACH"
                     ~ "Agricultural Science",
            UDDANNELSE == "H6090 Health science, BACH" ~ "Health Sciences")) 

# keep only what we ween to do the circles
degs2 <- degs1 %>%
    select(group = deg_field, value = INDHOLD, text)

# this creates x & y coordinates and circle radius values.
degs_packing <- circleProgressiveLayout(degs2$value, sizetype='area')
# new tibble putting it together
degs3 <- cbind(degs2, degs_packing)
# not sure exactly what this does, but it's an important step
degs.gg <- circleLayoutVertices(degs_packing, npoints=50)

Ok, we now have a nice clean tibble, time for the plot. What took the most time here was working on the labels…getting them to the correct size, adjusting for circle size. You’ll notice one label set off to the side thanks to the ggrepel package.

Show code for making the chart
ggplot() +
    geom_polygon_interactive(
        data = degs.gg,
        aes(x, y, group = id, fill=id, tooltip = data$text[id], data_id = id),
        colour = "black", alpha = 0.6) +
    scale_fill_viridis() +
    geom_text(data = degs3 %>% filter(value > 6000),
            aes(x, y, label = paste0(group, "\n", format(value, big.mark=","))),
            size=7, color="black") +
    geom_text(data = degs3 %>% filter(between(value, 2410, 3350)),
            aes(x, y, label = paste0(group, "\n", format(value, big.mark=","))),
            size=7, color="black") +
    geom_text(data = degs3 %>% filter(between(value, 600, 2410)),
            aes(x, y, label = paste0(group, "\n", format(value, big.mark=","))),
            size=6, color="black") +
    geom_text_repel(data = degs3 %>% filter(between(value, 200, 400)),
            aes(x, y, label = paste0(group, "\n", format(value, big.mark=","))),
            size=4, color="black",
            max.overlaps = Inf, nudge_x = -110, nudge_y = 50,
            segment.curvature = 0,
            segment.ncp = 8,
            segment.angle = 30) +
    labs(x = "", y = "",
             title = "Social Sciences were the most popular Bachelor's degrees awarded by Danish universities in 2023",
             subtitle = "*Labels not diplayed: Education = 134, Food Science = 61*",
             caption = "*Data from Danmarks Statistik via danstat package*") +
    theme_void() +
    theme(legend.position="none", plot.margin=unit(c(0,0,0,0),"cm"),
            plot.title = element_markdown(size = 16),
            plot.subtitle = element_markdown(size = 12),
            plot.caption = element_markdown(size = 8)) +
    coord_equal()

packed circle graph displaying Bachelors degrees awarded by field

There you have it, academic Bachelor’s degrees awarded by Danish universities in 2023. The Social Sciences category accounts for more than twice as many degrees as the next field. Again, the next interesting step would be to visualize the disaggregation of the main fields, and see what subfields are the most popular. It would also be interesting to track the numbers over time, see which majors rose or fell.

created and posted April 2, 2025

Prompt #5 - Ranking

For this prompt I desperately wanted to do something relating to music from the band The English Beat because of Ranking Roger. But I’d already committed to the Danish education bit, and Spotify shut down all the interesting API endpoints and don’t seem to be reanimating them any time soon. Something about “safety”, but it’s likely AI crawlers, the same ones increasing hosting costs for Wikipedia and other websites as they steal content for their models. And funny for Spotify to complain about someone else using their work to monetize a product.

But anyway, let’s talk ranks. Specifically let’s look at the rankings of Bachelor degrees awarded in Denmark in 2023. I’ll focus on the academic (4-year) degrees because I wanted to look at majors within broader disciplines and look at differences by gender.

Again, data comes from Danmarks Statistik via the danstat package.

So first, let’s get the data. It’s from the same table used in prompt 4 but retreived slightly differently so I’ll show it again. This time we’re getting all the majors within the broader disciplines so in the list call we use list(code = "uddannelse", values = "*").

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(packcircles) # makes the circles!
library(ggtext) # enhancements for text in ggplot
library(ggrepel) # helps with label offsets

# some custom functions
source("~/Data/r/basic functions.R")

# this table is UDDAKT60, set to all lower-case in the table_id call
table_meta <- danstat::get_table_metadata(table_id = "uddakt60", variables_only = TRUE)

# create variable list using the ID value in the variable.
variables_ed <- list(
    list(code = "uddannelse", values = "*"),
    list(code = "fstatus", values = c("F")),
    list(code = "køn", values = c("M", "K")),
    #list(code = "alder", values = c("TOT")),
    list(code = "tid", values = 2023))

bacdegs1 <- get_data("uddakt60", variables_ed, language = "en") %>%
    as_tibble()

bacdegs <- bacdegs1 %>%
    # clean up the degree text and code field
    mutate(deg_code = str_extract(UDDANNELSE, "^[^ ]+")) %>%
    mutate(deg_name = sub("^\\S+\\s+", '', UDDANNELSE)) %>%
    mutate(deg_name = str_remove(deg_name, ", BACH")) %>%
    mutate(deg_name = str_remove(deg_name, " BACH")) %>%
    mutate(deg_group = case_when(
        deg_code %in% c("H6020", "H6025", "H6030", "H6035", "H6039",
                                        "H6059", "H6075", "H6080", "H6090") ~ "Main",
        deg_code == "H60" ~ "All",
        TRUE ~ "Sub")) %>%
    # create discipline category for making individual plots.
    mutate(deg_field =  case_when(
        str_detect(deg_code, "H6020") ~ "Education",
        str_detect(deg_code, "H6025") ~ "Humanities",
        str_detect(deg_code, "H6030") ~ "Arts",
        str_detect(deg_code, "H6035") ~ "Science",
        str_detect(deg_code, "H6039") ~ "Social Sciences",
        str_detect(deg_code, "H6059") ~ "Technical sciences",
        str_detect(deg_code, "H6075") ~ "Technical sciences",
        str_detect(deg_code, "H6080") ~ "Agriculture & Veterinary",
        str_detect(deg_code, "H6090") ~ "Health science",
        TRUE ~ "no")) %>%
    rename(degs_n = INDHOLD, sex = KØN)

Now that we have tidy data, let’s make the charts. This time out it’s horizontal bar charts, ordered by number of degrees awarded. Note that because I’m faceting by gender, and because I want to show the top majors in descending order for each gender, I use the scales = "free_y" option in the facet function.

A few things before we see the work. First, I had hoped to write a function and output all the by-discipline charts but it was too complicated to get each chart to look exactly as I wanted, and I realized I’d have to do lots of individual formatting anyway. So sadly it’s “copy-paste-amend” for all the charts. I’ll only show the code for the chart showing all degrees and for one of the discipline charts, as the rest of the discipline charts were essentially the same, just minor edits on the geom_text() code.

Second, I had to make a design choice regarding the x axes. I decided to keep them standard to the maximum number of degrees by major within each discipline, so the x axes for men & women is the same. To me, having different x axes maximums in the same chart was distorting the scope of the bars, making different numbers look the same.

Third, I am only displaying degree counts, not percentages. I went back and forth on that as well, and decided to go with the counts and save the percentages for the “Diverging” prompt on Day 9.

Show code for making the charts

bacdegs %>%
    filter(deg_group == "Main") %>%
    {. ->> tmp} %>%
    ggplot(aes(x = degs_n, y = reorder_within(deg_name, degs_n, sex), fill = sex)) +
    geom_bar(stat = "identity") +
    scale_fill_manual(values = c("#E66100", "#5D3A9B")) +
    scale_y_reordered() +
    scale_x_continuous(labels = comma) +
    labs(x = "", y = "",
             title = "All Bachelor degrees by field & sex, 2023",
             subtitle = glue::glue("*Total degrees earned = 19,199: Men = 8,694, Women = 10,505*"),
             caption = "*Data from Danmarks Statistik via danstat package*") +
    facet_wrap(~ sex, scales = "free_y") +
    geom_text(data = subset(tmp, degs_n > 3000),
                        aes(label = comma(degs_n)), hjust = 1.5, color = "white") +
    geom_text(data = subset(tmp, degs_n < 3000),
                        aes(label = comma(degs_n)), hjust = -.5, color = "black") +
    theme_minimal() +
    theme(panel.grid = element_blank(),
                legend.position = "none",
                plot.subtitle = element_markdown(),
                plot.caption = element_markdown())


# agriculture and veterinary sciences
bacdegs %>%
    filter(deg_group == "Sub") %>%
    filter(deg_field == "Agriculture & Veterinary") %>%
    {. ->> tmp} %>%
    ggplot(aes(x = degs_n, y = reorder_within(deg_name, degs_n, sex), fill = sex)) +
    geom_bar(stat = "identity") +
    scale_fill_manual(values = c("#E66100", "#5D3A9B")) +
    scale_y_reordered() +
    scale_x_continuous(labels = comma) +
    labs(x = "", y = "",
             title = "Agriculture & Veterinary Science Bachelor degrees by field & sex, 2023",
             subtitle = glue::glue("*Total degrees earned = 308: Men = 55, Women = 253*")) +
    facet_wrap(~ sex, scales = "free_y") +
    geom_text(data = subset(tmp, degs_n > 100),
                        aes(label = comma(degs_n)), hjust = 1.5, color = "white") +
    geom_text(data = subset(tmp, degs_n < 100),
                        aes(label = degs_n), hjust = -.5, color = "black") +
    theme_minimal() +
    theme(panel.grid = element_blank(),
                legend.position = "none",
                plot.subtitle = element_markdown())

horizontal bar graph displaying Bachelors degrees awarded by field & sex

This first chart shows all degrees awarded by discipline and sex. The most immediate thing to notice is that more Bachelor degrees were awarded to women than to men. But that tracks with a chart I did last year, showing rates of educational attainment by sex and it’s similar to other countries, including the US.

Social science degrees were the most popular for both men and women. After that it changes a bit. Humanities were 2nd most popular for women, Technical Sciences next for men. And so forth. To be honest, it’s kind of a shock to see degree fields so starkly gendered now. I would have though some disciplines would have been a bit more evenly split. Again, we’ll dive deeper into the percentage differences on Day 9 for the “Diverging” prompt.

In the tab set below are the rest of the charts, one for each major discipline. If you’re interested you can comb through and see which majors were most popular in each discipline. They are lightboxed, so click on the image to enlarge it, and click esc or outside the image to restore it to the page.

The only ones I did not make were for Education and the Food-Biotech-Lab Tech degrees as each only had one major with degrees awarded. In Education there were 134 degrees awarded, 117 to women, 17 to men. For Food-Bio-Lab Tech there were 61 degrees total, 20 to Men, 41 to Women all in a general Food & Nutrition Science major.

This makes me think I could have done this in Tableau and let the user filter by discipline. Or I could have done parameterized charts in Quarto. Either way I could have added other degree levels. Lots of data prep, but it would be a useful chart. Maybe after the 30 Day Challenge is over.

created and posted April 4, 2025

Prompt #6 - Florence Nightingale

The obvious association with Florence Nightingale is that of nursing, so today let’s look at health-care degrees, and who gets them. I’m especially interested in degrees for a job called in shorthand here SOSU, or “social- og sundhedsassistent”. Essentially a nurses aid/nursing assistant. When I was in the hospital after my bicycle accident in 2024 I noticed almost all SOSUs were women and many first or second-generation immigrants.

Which makes sense when you think of recent immigrants to any country and the types of jobs the first wave immigrants, mainly those who don’t come in with college degrees or professional credentials, tend to get…merchants, housekeeping, health care assistants. They establish a stable life and their kids tend to get more education and higher paying jobs. It’s the immigration story told often over the decades and in many countries.

I feel like I need to make clear here that this is not intended to “other” immigrants to Denmark. I want to help explain SOSU degrees in the conext of their paths to economic and social stability through education and decently paying stable jobs.

You can read more about the path to a SOSU degree and job in this explainer and more about vocational secondary education in particular on pages 10-11 here in this PDF report.

Again, data comes from Danmarks Statistik via the danstat package. This time we’ll look at table UDDAKT35, which has degrees completed and includes immigration category.

First we’ll get the data.

Show code for getting and cleaning data
# Nursing and SOSU degrees by gender and immigrant status

library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(ggtext) # enhancements for text in ggplot
library(scales)
library(tidytext)

# some custom functions
source("~/Data/r/basic functions.R")

# UDDAKT35 - get english and danish versions of metadata to make sure we get the right degree codes.
table_meta <- danstat::get_table_metadata(table_id = "uddakt35", variables_only = TRUE)
table_meta_dk <- danstat::get_table_metadata(table_id = "uddakt35", variables_only = TRUE, language = "da")

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "uddannelse", values = c("H30", "H3010", "H301020")),
    list(code = "fstatus", values = c("F")),
    list(code = "køn", values = "*"),
    list(code = "herkomst", values = "*"),
    list(code = "tid", values = c(2019, 2020, 2021, 2022, 2023, 2024)))

# fetch the data using the variable list
sosu1 <- get_data("uddakt35", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

# set up the main dataset...create a main degree field, clean up the national origin field
sosu_all <- sosu1 %>%
    rename(sex = kon, national_origin = herkomst, year = tid, n = indhold) %>%
    select(-fstatus) %>%
    mutate(deg_code = str_extract(uddannelse, "^[^ ]+")) %>%
    mutate(deg_name = case_when(
        deg_code == "H30" ~ "VET All",
        deg_code == "H3010" ~ "Health care/Educ All",
        deg_code == "H301020" ~ "SOSU")) %>%
    mutate(national_origin = ifelse(
        national_origin == "Persons of Danish origin", "Danish origin", national_origin)) %>%
    mutate(national_origin =
        factor(national_origin,
        levels = c("Total", "Danish origin", "Descendant",
        "Immigrants", "Unknown origin"))) %>%
    mutate(sex = ifelse(sex == "Sex, total", "Total", sex))

We have the base frame now from which to make a few charts. All of them will cover the years 2019 to 2024. The chart code isn’t all that complicated so I won’t include them in this post. If you’re interested you can find them in the github repo for this year’s challenge.

We’ll first look at total SOSU degrees, then SOSU as a percent of all the vocational degrees in it’s main group. We’ll also look at sex and national origin.

bar graph displaying SOSU degrees awarded 2019 to 2024

The key insight that jumps out here is the more than 200% increase in SOSU degrees awarded during this period. The one thing I’m not sure of here if that has to do with any policy change or something else that helped the numbers increase so much over the last five years.

How about SOSU degrees about relative to all H30 vocational programs?

bar graph displaying SOSU degrees awarded 2019 to 2024 as a pct of all vocational degrees

SOSU degrees were 2% of the H30 vocational category in 2019 and 4% in 2024. The H30 group grew by almost 1,900 (6%) during that period, meaning SOSU degrees were responsible for more than the growth of all degrees in this group.

So who gets SOSU degrees in terms of gender and national origin status? Unsurprisingly, given the very gendered nature of nursing health care assistant jobs, about 90% of SOSU degrees go to women.

bar graph displaying SOSU degrees awarded 2019 to 2024

What about national origin? First it helps to explain how that’s defined in Denmark for these statistics (page is in Danish). Someone of Danish origin has at least 1 parent born in and a citizen of Denmark, regardless of where the person was born. A Descendant is considered someone to born in Denmark but neither parent is a citizen or born in Denmark. However if at least one of a Descendant’s parents acquire Danish citizenship, then that person is now of Danish origin. Finally, an Immingrant is considered as someone born abroad and neither parent is a citizen of or born in Denmark. Immigrants can of course gain citizenship, but I’m not sure if they then become considered of Danish origin for these statistics.

bar graph displaying SOSU degrees awarded 2019 to 2024

For the past few years, immigrants have comprised about 30% or so of all who earned a SOSU degree.

So how do these gender and national origin percentages compare to all H30 vocational programs?

bar graph displaying H30 degrees awarded 2019 to 2024

Well, the H30 degrees are not only much more balanced, a slight majority are awarded to men.

As for national origin…

bar graph displaying H30 degrees awarded 2019 to 2024

…immigrants are growing as a percent of all who earn H30 vocational degrees, at 14% in 2024 up from 8% in 2019. but that’s less than half the 34% percent of SOSU degrees awarded to immigrants.

So to sum up, its as expected; a vast majority of SOSU degrees go to women, and the percentage of SOSU degrees going to immigrants has been rising.

There’s plenty more to examine here, for instance:

  • How do SOSU degree demographics compare to nursing degrees and other medical fields?
  • What’s the demographic picture in other vocational fields? Or in academic secondary, or academic higher education?

created and posted April 5, 2025

Prompts #7 & 8 - Outliers & Histogram

These next six prompts are under a general theme of Distributions, or how data is arrayed or spread out by one of more factors.

Taking another look at vocational degrees, it turned out to be best to combine these first two prompts, outliers and histogram, into one post. So we’ll use the same table UDDAKT35, from Danmarks Statistik via the danstat package.

This time we’ll look at all degrees awarded and see the overall distribution (histogram), and degrees which had the most and fewest awarded (outliers). So reversing the prompts I guess, but keeping with the spirit of the challenge.

As always, we’ll fetch and clean the data first. Some slight changes to the previous prompt where we used this table. This time we’ll get all degree programs and just for 2024. We won’t worry about demographics this time.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(ggtext) # enhancements for text in ggplot
library(scales)
library(tidytext)

# some custom functions
source("~/Data/r/basic functions.R")

# UDDAKT35 - get english and danish versions of metadata to make sure we get the right degree codes.
table_meta <- danstat::get_table_metadata(table_id = "uddakt35", variables_only = TRUE)
table_meta_dk <- danstat::get_table_metadata(table_id = "uddakt35", variables_only = TRUE, language = "da")

# create variable list using the ID value in the variable
# getting all education programs, pare down
variables_ed <- list(
    list(code = "uddannelse", values = "*"),
    list(code = "fstatus", values = c("F")),
    list(code = "tid", values = 2024))

# fetch all the data
vet1 <- get_data("uddakt35", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

# clean up a bit, create category fields to help with charts
vet_all <- vet1 %>%
    mutate(deg_code = str_extract(uddannelse, "^[^ ]+")) %>%
    filter(!grepl("H29", deg_code)) %>%
    filter(!grepl("Total", deg_code)) %>%
    mutate(deg_level = if_else(str_length(deg_code) == 5, "Main", "Sub")) %>%
    mutate(deg_level = if_else(deg_code == "H30", "All", deg_level)) %>%
    mutate(deg_name = sub("^\\S+\\s+", '', uddannelse)) 

# create outlier categories. only majors with +0 degrees
vet_majors <- vet_all %>%
    filter(deg_level == "Sub") %>%
    filter(indhold > 0) %>%
    mutate(degs_mean = mean(indhold),
                 degs_std = sd(indhold),
                 degs_pctl10 = quantile(indhold, 0.10),
                 degs_pctl90 = quantile(indhold, 0.90)) %>%
    mutate(outlier = case_when(
        indhold >= degs_pctl90 ~ "outlier+ > 90thpctl",
        indhold <= degs_pctl10 ~ "outlier- < 10thpctl",
        TRUE ~ "not outlier")) %>%
    mutate(outlier = factor(outlier,
        levels = c("outlier+ > 90thpctl", "not outlier", "outlier- < 10thpctl")))

The vet_majors dataframe created at the end is what we’ll use for outlier analysis – it keeps only the majors (not the top-line category) and only for degrees awarded greater-than 0. As we’ll see in the histogram, the distributions were so skewed I set outlier status at equal-or-above the 90th percentile and equal-or-below the 10th percentiles.

Ok, let’s build the histogram. I added vertical lines at the median and mean to highlight the skew.

Show code for making the histogram
vet_majors %>%
    ggplot(aes(indhold)) +
    geom_histogram(fill = "#CC79A7", bins = 100) +
    geom_vline(aes(xintercept = mean(indhold)), color="black") +
    geom_vline(aes(xintercept = median(indhold)), color="#A9A9A9") +
    labs(x = "", y = "",
            title = "Eight programs with +1,000 degrees awarded accounted for 43% of the H30 Vocational degrees awarded in 2024",
        subtitle = "*Total H30 degrees awarded in 2024 = 30,570*",
        caption = "*Data from Danmarks Statistik table UDDAKT35 via danstat package*") +
    scale_y_continuous(expand = c(.001, 0), limits = c(0, 20),
            breaks = c(0, 1, 3, 6, 9, 12, 15, 18, 20)) +
    scale_x_continuous(breaks = c(0, 500, 1000, 2000, 3000, 4000, 5000, 6000),
        labels = comma) +
    theme_minimal() +
    theme(plot.title = element_markdown(size = 16, hjust = 0),
        plot.subtitle = element_markdown(size = 14, hjust = 0),
        plot.caption = element_markdown(),
        legend.title = element_blank(),
        legend.position = c(.98, .05),
        legend.justification = c("right", "bottom"),
        legend.direction = "horizontal",
        legend.box.just = "right",
        legend.margin = margin(6, 6, 6, 6),
        panel.grid = element_blank(),
        panel.border = element_blank())

histogram displaying H30 degrees awarded 2024

Most of the fields awarded fewer than 500 degrees. Only 8 awarded more than 1,000. So a heavy left skew to the distribution.

So which fields had the most and least relative to the others? The code is below, we’ll talk about the chart below the image. I used the white-space resulting from the array of data as space for the legend and explanatory notes. For those annotations I created a couple of text vectors to plug in, one with coloring to use as richtext in the annotate call.

Show code for making the outlier chart
# horizontal bar with fills based on outlier status, bars for mean and median
# for annotation,
caption1 <- paste(strwrap("Health, education, business (office, retail, trading (mercantile)), and
        craft trades (electrician, carpentry, mechanic) are by far the most popular vocational degree choices.", 75), collapse = "\n")

caption2 <- glue::glue(
    "<span style = 'color:#A9A9A9;'>Median = 97   </span><span style = 'color:#000000;'>
    Mean = 392</span>"
)

vet_majors %>%
    ggplot(aes(x = indhold, y = reorder(deg_name, indhold), fill = outlier)) +
    geom_bar(stat = "identity") +
    scale_fill_manual(values = c("#0072B2", "#CC79A7", "#D55E00")) +
    geom_vline(aes(xintercept = mean(indhold)), color="black") +
    geom_vline(aes(xintercept = median(indhold)), color="#A9A9A9") +
    scale_x_continuous(limits = ~ c(0, max(.x) + 0.1), breaks = pretty_breaks(),
        expand = c(0, 0), labels = comma) +
    labs(x = "", y = "",
        title = "Total H30 degrees awarded in 2024 = 30,570",
        caption = "*Data from Danmarks Statistik table UDDAKT35 via danstat package*") +
    annotate("text", x = 2000, y = "Storage, port and terminal education",
        label = caption1, size = 5, hjust = 0) +
    annotate("richtext", x = 4700, y = "Train preparation educations",
        label = caption2,
        size = 4.5, label.color = NA) +
    theme_minimal() +
    theme(plot.title = element_markdown(size = 16),
        plot.caption = element_markdown(),
        axis.text.y = element_text(size = 7),
        legend.title = element_blank(),
        legend.position = c(.98, .05),
        legend.justification = c("right", "bottom"),
        legend.direction = "horizontal",
        legend.box.just = "right",
        legend.margin = margin(6, 6, 6, 6),
        panel.grid = element_blank(),
        panel.border = element_blank())

horizontal bar displaying H30 degrees awarded by field 2024

What we see is there are outliers, if we define an outlier as outside the 10th or 90th percentiles. We knew from prompt 6 that SOSU degrees (social health and education) would account for the greatest proportion of vocational degrees, so no surprise that’s tops. We also see business & commerce related fields as well as craft & trades also being very popular. Glazier and fitness instructor are among the low outliers, below the 10th percentile in the distribution.

A quick note on what might look like odd degree names. Obviously the original names were in Danish. But the translation to English isn’t always perfect. So for example, gartneruddannelsen translates to “Gartner Education” but what it should be is “Gardening”, or training to be a florist or work in landscaping. “Trading Education” comes from handelsuddannelsen which is essentially training to be a merchant - sourcing, logistics, etc.

created and posted April 5, 2025

Prompt #9 - Diverging

For the Diverging prompt last year I looked at educational attainment by gender in Denmark. This year I’ll do a similar chart, and we’ll take a different look at the same data used in the Day 5 Ranking prompt. Here though, instead of arraying degrees and majors in terms of raw counts, we’ll look at the percentage differences between gender by discipline and major.

We will go back to the UDDAKT60 table from Danmarks Statistik via the danstat package.

It’s a bit of a long post because I’m walking though the process of building the nine discipline charts via a function. If you’re most interested in the charts, scroll through the how-to parts.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(packcircles) # makes the circles!
library(ggtext) # enhancements for text in ggplot
library(ggrepel) # helps with label offsets

# some custom functions
source("~/Data/r/basic functions.R")

# this table is UDDAKT60, set to all lower-case in the table_id call
table_meta <- danstat::get_table_metadata(table_id = "uddakt60", variables_only = TRUE)

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "uddannelse", values = "*"),
    list(code = "fstatus", values = c("F")),
    list(code = "køn", values = c("M", "K")),
    #list(code = "alder", values = c("TOT")),
    list(code = "tid", values = 2023))

# fetch the data, use the janitor package to clean the names
bacdegs1 <- get_data("uddakt60", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

# build the main dataset for charting
bacdegs <- bacdegs1 %>%
    select(-fstatus) %>%
    # create categories for filtering - deg_group is category for all, discipline, or major 
    # deg_name is major, deg_field is discipline, 
    mutate(deg_code = str_extract(uddannelse, "^[^ ]+")) %>%
    mutate(deg_name = sub("^\\S+\\s+", '', uddannelse)) %>%
    mutate(deg_name = str_remove(deg_name, ", BACH")) %>%
    mutate(deg_name = str_remove(deg_name, " BACH")) %>%
    mutate(deg_group = case_when(
        deg_code %in% c("H6020", "H6025", "H6030", "H6035", "H6039",
                "H6059", "H6075", "H6080", "H6090") ~ "Main",
        deg_code == "H60" ~ "All",
        TRUE ~ "Sub")) %>%
    mutate(deg_field =  case_when(
        str_detect(deg_code, "H6020") ~ "Education",
        str_detect(deg_code, "H6025") ~ "Humanities",
        str_detect(deg_code, "H6030") ~ "Arts",
        str_detect(deg_code, "H6035") ~ "Science",
        str_detect(deg_code, "H6039") ~ "Social Sciences",
        str_detect(deg_code, "H6059") ~ "Technical sciences",
        str_detect(deg_code, "H6075") ~ "Food/Biotech/Lab Tech",
        str_detect(deg_code, "H6080") ~ "Agriculture Nature Environment",
        str_detect(deg_code, "H6090") ~ "Health science",
        TRUE ~ "All")) %>%
    mutate(deg_field = factor(deg_field,
        levels = c("All", "Agriculture Nature Environment", "Arts", "Education", "Food/Biotech/Lab Tech",
        "Health science", "Humanities", "Science", "Social Sciences", "Technical sciences"))) %>%
    rename(degs_n = indhold, sex = kon)

## dataframe to create faint highlight lines in chart
vlines_df <- data.frame(xintercept = seq(-100, 100, 20))

Putting this together I first built a chart at the discipline level, showing the percentage differences by gender on a diverging graph. It’s a bit of a remix of last year’s chart for the same prompt. The code for that is below as is the chart.

Show code for making the first chart
bacdegs %>%
    filter(deg_group == "Main") %>%
    select(deg_field, sex, degs_n) %>%
    group_by(deg_field) %>%
    mutate(deg_field_tot = sum(degs_n)) %>%
    ungroup() %>%
    group_by(deg_field, sex) %>%
    mutate(deg_sex_pct = round(degs_n /deg_field_tot, 3)) %>%
    mutate(deg_sex_pct = ifelse(sex == "Men", deg_sex_pct *-1, deg_sex_pct)) %>%
    mutate(deg_sex_pct2 = round(deg_sex_pct * 100, 1)) %>%
    ungroup() %>%
    mutate(deg_field = fct_reorder(deg_field, desc(deg_field))) %>%
    {. ->> tmp} %>%
    ggplot() +
    geom_col(aes(x = -50, y = deg_field), width = 0.75, fill = "#e0e0e0") +
    geom_col(aes(x = 50, y = deg_field), width = 0.75, fill = "#e0e0e0") +
    geom_col(aes(x = deg_sex_pct2, y = deg_field, fill = sex, color = sex), width = 0.75) +
    scale_x_continuous(expand = c(0, 0),
        labels = function(x) abs(x), breaks = seq(-100, 100, 20)) +
    geom_vline(data = vlines_df, aes(xintercept = xintercept), color = "#FFFFFF", size = 0.1, alpha = 0.5) +
    coord_cartesian(clip = "off") +
    scale_fill_manual(values = c("#E66100", "#5D3A9B")) +
    scale_color_manual(values = c("white", "white")) +
    geom_text(data = subset(tmp, sex == "Men"),
        aes(x = deg_sex_pct2, y = deg_field, label = paste0(abs(deg_sex_pct2), "%")),
        size = 5, color = "white", position = position_dodge(width = .4), hjust = -0.09) +
    geom_text(data = subset(tmp, sex == "Women"),
        aes(x = deg_sex_pct2, y = deg_field, label = paste0(abs(deg_sex_pct2), "%")),
        size = 5, color = "white", position = position_dodge(width = .4), hjust = 1.1) +
    geom_text(aes(x = -100, y = deg_field, label = paste0("Total awarded = ", comma(deg_field_tot))),
        size = 4, color = "#3b3b3b", hjust = -.1) +
    labs(x = "", y = "",
        title = "<span style = 'color: #5D3A9B;'>Women</span> earn more Bachelor's degrees than
        <span style = 'color: #E66100;'>men</span> in all disciplines except Science and Technical sciences",
        subtitle = glue::glue("*Total degrees awarded = 19,199:
        <span style = 'color: #E66100;'>Men = 8,694 (45%)</span>,
        <span style = 'color: #5D3A9B;'>Women = 10,505 (55%)</span>*"),
        caption = "*Data from Danmarks Statistik table UDDAKT60 via danstat package*") +
    theme_minimal() +
    theme(panel.grid = element_blank(),
        plot.title = element_markdown(size = 16),
        plot.subtitle = element_markdown(size = 12),
        plot.caption = element_markdown(),
        legend.position = "none", legend.title = element_blank(),
        axis.text.y = element_text(size = 10))
rm(tmp)

diverging horizontal bar displaying bachelor degrees by sex 2023

The next step is to repeat this chart for each discipline, showing the top 10 majors by overall degrees awarded. Rather than copy-paste-amend like in prompt 5, this time I set about doing it via a plot function. As you can see from the code for the first chart though, I wanted to add descriptive and informational text in the title and subtitle. It was a bit complicated to do this functionally and I’m sure it took me much longer to sort out the data structures and function mapping, but it worked. I’ll walk through the steps here before displaying the charts. After the challenge I’ll do a walk-through as a stand-alone post.

First up is building dataframes for each discipline, each frame with consistent naming. I went about it this way to avoid errors in filtering when running the plot function across disciplines. I took a similar approach in my October 2024 post on attendance at football (soccer) matches around the world.

So first a function to build the dataframes.

Show code for making the data frames
## function for individual discipline tables
disc_top10 <- function(degfield, dfname = "NA") {
dfout <- bacdegs %>%
    filter(deg_group == "Sub") %>%
    filter(deg_field == degfield) %>%
    select(deg_field, deg_name, sex, degs_n) %>%
    mutate(degs_all_tot = sum(degs_n)) %>%
# creates fields to use in plot dynamic subtitle
    group_by(sex) %>%
    mutate(degs_all_sex_tot = sum(degs_n)) %>%
    mutate(degs_all_sex_pct = round(degs_all_sex_tot / degs_all_tot, 3) * 100) %>%
    mutate(degs_all_sex_pct = paste0(degs_all_sex_pct, "%")) %>%
    ungroup() %>%
    group_by(deg_name) %>%
    mutate(deg_name_tot = sum(degs_n)) %>%
    ungroup() %>%
    arrange(desc(deg_name_tot), deg_name) %>%
    mutate(rank=row_number()) %>%
    group_by(deg_name, sex) %>%
    mutate(deg_sex_pct = round(degs_n /deg_name_tot, 3)) %>%
    mutate(deg_sex_pct = ifelse(sex == "Men", deg_sex_pct *-1, deg_sex_pct)) %>%
    mutate(deg_sex_pct2 = round(deg_sex_pct * 100, 1)) %>%
    ungroup() %>%
    # more fields to use in plot dynamic subtitle 
    mutate(deg_all_men_tot = ifelse(sex == "Men", degs_all_sex_tot, NA)) %>%
    mutate(deg_all_women_tot = ifelse(sex == "Women", degs_all_sex_tot, NA)) %>%
    mutate(deg_all_men_pct = ifelse(sex == "Men", degs_all_sex_pct, NA)) %>%
    mutate(deg_all_women_pct = ifelse(sex == "Women", degs_all_sex_pct, NA)) %>%
    mutate(deg_field_men_pct = ifelse(sex == "Men", paste0(abs(deg_sex_pct2), "%"), NA)) %>%
    mutate(deg_field_women_pct = ifelse(sex == "Women", paste0(deg_sex_pct2, "%"), NA))  %>%
    group_by(deg_name) %>%
    fill(deg_all_men_tot) %>%
    fill(deg_all_women_tot, .direction = "up") %>%
    fill(deg_all_men_pct) %>%
    fill(deg_all_women_pct, .direction = "up") %>%
    fill(deg_field_men_pct) %>%
    fill(deg_field_women_pct, .direction = "up") %>%
    ungroup()
# uses the second input to name the dataframe and put it in the global environment
assign(str_c(dfname, "_pcts"), dfout, envir=.GlobalEnv)
}

To run the function once it’s disc_top10("Science", "Science"). The first input is the discipline name, the 2nd is for the dataframe name.

But how do we map it across the disciplines? Well, we need a list of disciplines, and we do that by creating a list of fields.

degfields <- unique(bacdegs$deg_field)

Now we map it over the function, passing each mapping component to the .x sign.

map(degfields, ~disc_top10(degfield = .x, dfname = .x))

Ok, let’s build the plot function.

Show code for the plot function
## plot function
majorplot <- function(plotdf, degfield, plotname = "NA") {

    ## for faint highlight lines in chart
    vlines_df <- data.frame(xintercept = seq(-100, 100, 20))

    #create plot
    plotout <- plotdf %>%
        filter(deg_field == degfield) %>%
        filter(rank < 21) %>%
        # to get top 10 for plot sorted in order of total degrees in major
        arrange(desc(deg_name_tot)) %>%
        mutate(deg_name = fct_reorder(deg_name, deg_name_tot)) %>%
        # not used, but this orders top 10 for plot in degree name order
    # mutate(deg_name = fct_reorder(deg_name, desc(deg_name))) %>%
        ggplot() +
    # makes the grey bars
    geom_col(aes(x = -50, y = deg_name), width = 0.75, fill = "#e0e0e0") +
    geom_col(aes(x = 50, y = deg_name), width = 0.75, fill = "#e0e0e0") +
        # makes the bars for each gender, fill colors in the scale_fill_manual below
    geom_col(aes(x = deg_sex_pct2, y = deg_name, fill = sex, color = sex), width = 0.75) +
    scale_x_continuous(expand = c(0, 0),
                                         labels = function(x) abs(x), breaks = seq(-100, 100, 20)) +
    scale_fill_manual(values = c("#E66100", "#5D3A9B")) +
    # adds faint guide lines for the bars
    geom_vline(data = vlines_df, aes(xintercept = xintercept), color = "#FFFFFF", size = 0.1, alpha = 0.5) +
    coord_cartesian(clip = "off") +
    scale_color_manual(values = c("white", "white")) +
    # adds text labels for each major's percent of men & women
    geom_text(data = subset(plotdf, sex == "Men" & rank < 21),
        aes(x = deg_sex_pct2, y = deg_name, label = paste0(abs(deg_sex_pct2), "%")),
        size = 5, color = "white", position = position_dodge(width = .4), hjust = -0.09) +
    geom_text(data = subset(plotdf, sex == "Women" & rank < 21),
        aes(x = deg_sex_pct2, y = deg_name, label = paste0(abs(deg_sex_pct2), "%")),
        size = 5, color = "white", position = position_dodge(width = .4), hjust = 1.1) +
    # inserts the total number of degrees for each discipline into the bars to the 
        # left or right of the percentage bars
    geom_text(data = subset(plotdf, sex == "Men" & rank < 21 & deg_sex_pct > -0.85),
        aes(x = -100, y = deg_name, label = paste0("Total awarded = ", comma(deg_name_tot))),
        size = 3.5, color = "#3b3b3b", hjust = -.1) +
    geom_text(data = subset(plotdf, sex == "Men" & rank < 21 & deg_sex_pct < -0.85),
        aes(x = 80, y = deg_name, label = paste0("Total awarded = ", comma(deg_name_tot))),
        size = 3.5, color = "#3b3b3b", hjust = .5) +
    labs(x = "", y = "",
        caption = "*Data from Danmarks Statistik table UDDAKT60 via danstat package*",
        # uses html tags for color and fields from the input df to make the title
          # specific to the discipline
        subtitle = glue::glue("*Total degrees awarded in {degfield} = {comma(plotdf$degs_all_tot)}:
            <span style = 'color: #E66100;'>Men =
            {comma(plotdf$deg_all_men_tot)} ({plotdf$deg_all_men_pct})</span>,
            <span style = 'color: #5D3A9B;'>Women =
            {comma(plotdf$deg_all_women_tot)} ({plotdf$deg_all_women_pct})</span>.
            Only top 10 majors displayed.*")) +
    theme_minimal() +
    theme(panel.grid = element_blank(),
            plot.title = element_markdown(size = 16, hjust = 0),
            plot.subtitle = element_markdown(size = 12, hjust = 0),
            plot.caption = element_markdown(),
            legend.position = "none", legend.title = element_blank(),
            axis.text.y = element_text(size = 10))
    # to create individual plot names and place plots in the global environment
    assign(str_c(plotname, "_plot"), plotout, envir=.GlobalEnv)
    # outputs plots 
    return(plotout)
}

The function takes three inputs - the dataframe to run it on, the filter for the discipline in the plot, and the output lot name. The 2nd & 3rd input are the same. To map it over the nine disciplines we need the list of dataframes and the list of disciplines. The code below shows how to build those

# create list of data frames to mutate function over
df_list <- list(mget(ls(pattern = "local")))[[1]]
df_list = mget(ls(pattern = "_pcts"))

# for degree fields to loop over in function, need a character vector of just major names
# first a df of just the field names
bacdegs_maj <- bacdegs %>%
    filter(deg_group == "Sub") %>%
    mutate(deg_fieldc = as.character(deg_field)) %>%
    select(deg_fieldc)

# then output names to character vector, making sure to sort so that the filter matches to 
 # the 
degfieldsc <- (unique(bacdegs_maj$deg_fieldc)) %>%
    str_sort()

Now we’re ready to build the charts. Because of the way the input dataframes were built we’ll have title and subtitle text specific to each discipline. Here’s the mapping call:

map2(df_list, degfieldsc, ~majorplot(plotdf = .x, degfield = .y, plotname = .y))

That outputs nine plots that only need minor customization for the explanatory title. Everything else was taken care of in the functions. Because Education and the Food Science disciplines only have one major each, no additional charts for them. The gender difference is in the main chart.

I’ll display the seven annotated charts in the tabset below, showing the customization for the Humanities discipline. They are lightboxed, so click on the image to enlarge it, and click esc or outside the image to restore it to the page.

horizontal bar graph displaying Bachelors degrees awarded by field & sex

horizontal bar graph displaying Bachelors degrees awarded by field & sex

horizontal bar graph displaying Bachelors degrees awarded by field & sex

horizontal bar graph displaying Bachelors degrees awarded by field & sex

Humanities_plot +
    labs(title = "Humanities degrees mostly went to <span style = 'color: #5D3A9B;'>women</span>.
             By major, <span style = 'color: #E66100;'>men</span> only earned more degrees in History")

horizontal bar graph displaying Bachelors degrees awarded by field & sex

horizontal bar graph displaying Bachelors degrees awarded by field & sex

horizontal bar graph displaying Bachelors degrees awarded by field & sex

The overarching take-away is one of some noticeable difference or divergence by gender in many disciplines and majors. To be honest, I was surprised at the extent and size of some differences. I remember seeing it in some colleges and majors at UC Berkeley, but even in the so-called “male” majors women were becoming more and more evenly represented.

The next step for this subject would be to look at some of the more gendered majors and examine whether or not the differences have held steady over time, and if anything explains the change or stasis. Maybe after the challenge or for one of the challenge time-series prompts.

created and posted April 6, 2025

Prompt #13 - Clusters

For this “back to the future” post, a prompt coming between 9 (posted on April 6) and 14 (posted on April 18) but posted after April 18, let’s take a look at R&D expenses by source and research field in Denmark. The prompt is “clusters”, so I’m clustering research activity by the general field and the sector doing the work.

As always for this year’s challenge, the data is from Danmarks Statistik via the danstat package. This time we’re using table RDCE05, which is Expenses to R&D by sector and field of research and development. I’ll mention it more than once, but all expense figures are in millions of Danish kroner. So 1,500 DKK here means 1.5 billion. As of the date of posting, one US dollar gets you about 6.57 kroner. That’s about half a kroner lower than this time last year.

It’s also worth mentioning at least once that the unit of analysis is research expenses, not funding. Though you could argue that expenses are a proxy for funding, because the money has to come from somewhere. Expenses as reported here means money spent on any and all R&D activity and the people who do it. The funding sources are irrelevant.

I would love to dig in a bit deeper and connect funding sources, the sectors that get funding, and for which fields. But that means at the very least connecting a number of tables or having micro-data access.

I won’t spend any time explaining the research fields and sectors. You can get a detailed look at how Danmarks Statistik describes the data is here. The research fields follow the rubric laid out in the OECD’s Frascati Manual, with the fields themselves enumerated here.

Do keep in mind that science is interdisciplinary. So data science work, for example, might be in the Engineering and Tech field per the Frascati Manual. But it’s quite likely that data science work is happening in medical or social sciences. It’s hard to know how every activity is being classified without access to microdata and then being able to tease out interdisciplinary overlaps.

For the plot itself, I took inspiration for the scatterplot from Nicola Rennie’s cluster chart, modifying the colors and other aesthetics a bit. But the code structure starts with Nicola’s work, and I learned a bit about some good coding practice and tips. Which is the point of an exercise like this, right?

Let’s start by getting and cleaning the data.

Show code for getting and cleaning data
#package and process for getting & cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(gt) # for tables
library(ggiraph) # interactive plot with tooltip
library(ggtext) # add markdown to plot labels
library(scales) # better number formatting

source("~/Data/r/basic functions.R")

# RDCE05
table_meta <- danstat::get_table_metadata(table_id = "rdce05", variables_only = TRUE)
table_meta_dk <- danstat::get_table_metadata(table_id = "rdce05", variables_only = TRUE, language = "da")

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "sektor", values = "*"),
    list(code = "videnhoved", values = "*"),
    list(code = "tid", values = c(2017, 2023)))

r_and_d1 <- get_data("rdce05", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

r_and_d_main <- r_and_d1 %>%
    mutate(sektor = str_remove(sektor, " sector")) %>%
    mutate(sektor = case_when(
        sektor == "Total source of funding" ~ "Total funding",
        sektor == "Business enterprise" ~ "Business",
        TRUE ~ sektor)) %>%
    mutate(videnhoved = case_when(
        videnhoved == "Agricultural and veterinary sciences" ~ "Agri & veterinary sci",
        videnhoved == "Engineering and technology" ~ "Eng & tech",
        videnhoved == "Humanities and the arts" ~ "Humanities & arts",
        videnhoved == "Medical and health sciences" ~ "Medical & health sci",
        TRUE ~ videnhoved)) %>%
    select(sector = sektor, field = videnhoved, year = tid, dkk_million = indhold)

Before the main plot, let’s get a lay of the land for R&D expenses. Note the pctchange() function. It’s one I have among a bunch of basic functions I created and load into every project using source("~/Data/r/basic functions.R") just after loading packages. The function itself is simple, just (x - lag(x)) / lag(x). But having it as a function saves me from having to redo the formula every time I use it.

The first table shows expenses for R&D by field of research in Denmark in 2017 & 2023. Below that we see expenses by sector. All values in millions of Danish kroner (DKK).

Show code for making the 1st table

# arrange the data
r_and_d_main %>%
    filter(sector == "Total funding") %>%
    group_by(field) %>%
    mutate(pct_chg_funding = pctchange(dkk_million)) %>%
    ungroup() %>%
    pivot_wider(names_from = year, values_from = c(dkk_million, pct_chg_funding)) %>%
    mutate(pct_chg_2023 = round(pct_chg_funding_2023, 3)) %>%
    select(field, dkk_million_2017, dkk_million_2023, pct_chg_2023) %>%
    arrange(field) %>%
    # build the table
    gt() %>%
    cols_label(
        field = "Research area",
        dkk_million_2017 = "2017",
        dkk_million_2023 = "2023",
        pct_chg_2023 = "% change 2017-2023") %>%
    tab_style(
        style = cell_text(align = "center"),
        locations = cells_column_labels(
            columns = c(dkk_million_2017, dkk_million_2023, pct_chg_2023))) %>%
    fmt_number(columns = c(dkk_million_2017, dkk_million_2023),
        sep_mark = ",", decimals = 0) %>%
    fmt_percent(columns = pct_chg_2023, decimals = 1) %>%
    opt_align_table_header(align = "left") %>%
    opt_stylize(style = 5)
Research area 2017 2023 % change 2017-2023
Agri & veterinary sci 1,747 2,968 69.9%
Eng & tech 25,651 27,440 7.0%
Humanities & arts 2,039 1,985 −2.6%
Medical & health sci 20,416 34,213 67.6%
Natural sciences 11,033 14,473 31.2%
Social sciences 3,397 4,561 34.3%
Sector 2017 2023 % change 2017 to 2023
Business 40,757 53,338 30.9%
Government 1,973 2,157 9.3%
Higher education 21,355 29,821 39.6%
Private non-profit 198 325 64.1%

Agriculture/Veterinary & Medical/Health sciences had biggest increases in R&D expenses from 2017-2023. Engineering/Tech & Medical had highest R&D expenses in 2017, but was overtaken by Medical & Health Sciences in 2023. Again, however, given how prevalent data science technologies are in medical sciences and other fields, it’s entirely possible that there’s lots of data science going on in the medical fields that might have been classified as Engineering & Information Technology in prior years.

The Private non-profit sector had biggest increase in R&D expenses 2017-2023, though Business sector by far has the highest R&D expenses.

My initial reaction to this was surprise, but that’s due to my skewed perspective of R&D in the US; I am so steeped in the story of federally-sponsored research and how it grew out of WWII research funding. Billions of dollars in basic and applied research is indeed funded by the federal government (and some from foundations) and carried out via contracts with universities. I assumed the federal government provided far more than the business sector. But surprise, per this Brookings Institute report since the mid-1980s the business sector has been funding more R&D in the US than the federal government.

The story here in Denmark is not that different.

So let’s look at percent change by share of expenses in sector and the fields of research in which the sectors spend funds.

First we’ll do some data prep to get a dataframe to use for plotting.

Show code for making the scatterplot data
## first make colors for geom_points
bg_col <- "grey97"
text_col <- "black"
highlight_col <- "#0072B2"
highlight_col2 <- "#CC79A7"

# create the dataframe
r_and_d_plotdf <-
r_and_d_main %>%
    filter(!sector == "Total funding") %>%
    group_by(sector, field) %>%
    mutate(pct_chg_funding = pctchange(dkk_million)) %>%
    ungroup() %>%
    pivot_wider(names_from = year, values_from = c(dkk_million, pct_chg_funding)) %>%
    select(-pct_chg_funding_2017) %>%
    mutate(dkk_million_2023_c = formatC(dkk_million_2023, format = "d", big.mark = ",")) %>%
    mutate(dkk_million_2023_c = paste0(dkk_million_2023_c, " mill DKK")) %>%
    group_by(sector) %>%
    mutate(sector_funding  = sum(dkk_million_2023)) %>%
    mutate(sector_field_funding_pct  = dkk_million_2023 / sum(sector_funding)) %>%
    ungroup() %>%
    mutate(sector_funding_pct  = sector_funding / sum(dkk_million_2023)) %>%
    mutate(sector_short = case_when(
        sector == "Government" ~ "Gov't",
        sector == "Higher education" ~ "Higher ed",
        sector == "Private non-profit" ~ "NGO",
        TRUE ~ sector)) %>%
    mutate(sector_field = paste0(sector_short, "/", field)) %>%
    filter(!is.na(pct_chg_funding_2023)) %>%
    filter(dkk_million_2017 > 0) %>%
    mutate(col = case_when(
        pct_chg_funding_2023 > 0 ~ highlight_col2,
        pct_chg_funding_2023 == 0 ~ "grey60",
        pct_chg_funding_2023 < 0 ~ highlight_col))

Now we build the plot. I wanted to add hover-over tool-tips so I’m using the ggiraph package. The only real coding difference from a static scatterplot is using the geom_point_interactive function. After we create the plot as an object, we call back to it using the ggiraph::girafe function to render it.

When you hover over the dots you’ll see the sector/field combination and the amount of expenses they had in 2023, in millions of Danish kroner (DKK). Also note that hovering over say “Gov’t / Eng & tech” not only brings up the tool-tip with 2023 expenses, it also highlights all Government dots, so you can put the Engineering & Tech expenses in contrast with and comparison to other government R&D expenses. Note again, that expenses is essentially a proxy for funding, regardless of funding source.

If you start with some of the outliers along the x axis (sector/field combo percentage of all expenses) you notice that most of the outliers are in the medical sciences field, across sectors. Which makes sense from the tables that showed us how much funding medical sciences gets overall relative to other fields. So voila, a cluster!

As you hover over the dots, what other patterns or clusters do you see?

created and posted April 21, 2025

Prompt #14 - Kinship

Full disclosure…depending on when you read this, you may or may not see any posts between this one and prompt 9. As you can see by the “created and posted” date, if/when I’ve backfilled some prompts between 9 and here, this was the next post I completed after 9. Life gets in the way sometimes.

Anyway, I knew when the prompts came out that for kinship that I’d want to make a sankey diagram to look at educational pathways depending on parent education. A sankey is a form of network graph, and I wanted to see the connection, or network effect, of parent education on their children’s educational attainment.

StatBank’s education section does have some tables for parent -> child education, but they aren’t as disaggregated as I’d like. So to see educational outcomes by parent educational attainment you have to choose whether to look at secondary outcomes or higher ed outcomes. We’re going here with higher ed, so it’s table STATUSV2: 25-45 year-olds, by status for higher education, age and parents attainment.

Slight aside…there is micro-data that can answer the question as detailed as I would like. The EduQuant project, located in the Economic Department at Københavns Universitet and funded by the Danish Education Ministry, has access to that data and has produced some work on social mobility and education.

There are a few ways to do sankeys in r. After some trial-and-error, I settled on using the networkD3 package. A big reason being that the data comes aggregated, and some other packages like ggsankey and ggalluvial weren’t handling the data as well as networkD3.

To get the final result I was helped by this post from Louise Sinks and this video from the UK’s NHS R community.

Let’s get started…first loading some packages and fetching and cleaning the data. The data cleaning is mainly to make the parent education levels easier to read, and to create age groups. I didn’t use the age groups for this chart, but I think I will for the “complicated” prompt.

Show code for getting and cleaning data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(janitor) # tools for data cleaning
library(danstat) # package to get Danish statistics via api
library(infer) # tidy statistical inference
library(scales)
library(networkD3) # for Sankey plots
library(htmlwidgets) # html widgets helps to handle the networkD3 objects
library(htmltools) # for formatting html code
library(gt)


# some custom functions
source("~/Data/r/basic functions.R")

table_meta <- danstat::get_table_metadata(table_id = "statusv2", variables_only = TRUE)
table_meta_dk <- danstat::get_table_metadata(table_id = "statusv2", variables_only = TRUE, language = "da")

# create variable list using the ID value in the variable
variables_ed <- list(
    list(code = "statusvid", values = "*"),
    list(code = "alder1", values = "*"),
    list(code = "forudd1", values = "*"),
    list(code = "tid", values = 2023))

# fetch the data
ed_attain1 <- get_data("statusv2", variables_ed, language = "en") %>%
    as_tibble() %>%
    clean_names()

# create a main data frame with cleaned up educational levels and renaming the variables
ed_attain_main <- ed_attain1 %>%
    # let's make the Total values easier to read
    mutate(statusvid =
            ifelse(statusvid == "Educational graduation statement, total", "Total", statusvid)) %>%
    mutate(alder1 =
            ifelse(alder1 == "Age, total", "Total", alder1)) %>%
    # let's make the parent ed values a bit easier to read and understand
    mutate(forudd1 =
            case_when(forudd1 == "H10 Primary education" ~ "Primary",
                forudd1 == "H20 Upper secondary education" ~
                    "HS-Academic",
                forudd1 == "H30 Vocational Education and Training (VET)" ~
                    "HS-Vocational",
                forudd1 == "H40 Short cycle higher education" ~
                    "Short-cycle college",
                forudd1 == "H50 Vocational bachelors educations" ~
                    "Bachelor-Vocational",
                forudd1 == "H60 Bachelors programs" ~
                    "Bachelor-Academic",
                forudd1 == "H70 Masters programs" ~ "Masters",
                forudd1 == "H80 PhD programs" ~ "PhD",
                forudd1 == "H90 Not stated" ~ "Not stated",
                TRUE ~ forudd1)) %>%
    mutate(statusvid = ifelse(
        statusvid == "No registrated education", "No registered education", statusvid)) %>%
    # create age groups in case I want to disaggregate that way
    mutate(child_age_group =
            case_when(
                alder1 %in% c("25 years", "26 years", "27 years", "28 years", "29 years") ~ "25-29",
                alder1 %in% c("30 years", "31 years", "32 years", "33 years", "34 years") ~ "30-34",
                alder1 %in% c("35 years", "36 years", "37 years", "38 years", "39 years") ~ "35-39",
                alder1 %in% c("40 years", "41 years", "42 years",
                    "43 years", "44 years", "45 years") ~ "40-45",
                TRUE ~ alder1)) %>%
    select(parent_ed = forudd1, child_age = alder1, child_age_group, child_ed = statusvid, N = indhold)

Before we get to the sankey diagram, first a quick table so we can see the breakdown of parent educational attainment and children’s higher education status, independent of each other.

Show code for making the descriptive tables
ed_attain_main %>%
    filter(!parent_ed == "Total") %>%
    filter(child_age == "Total") %>%
    filter(child_ed == "Total") %>%
    select(parent_ed, N) %>%
    mutate(pct = N/ sum(N)) %>%
    gt() %>%
    cols_label(parent_ed = "Parent education level") %>%
    fmt_number(columns = N, sep_mark = ",", decimals = 0) %>%
    fmt_percent(columns = pct, decimals = 1) %>%
    opt_stylize(style = 5)
Parent education level N pct
Primary 177,985 11.3%
HS-Academic 32,585 2.1%
HS-Vocational 562,666 35.6%
Qualifying educ paths 587 0.0%
Short-cycle coll 75,482 4.8%
Bach-Vocational 282,343 17.9%
Bach-Academic 12,126 0.8%
Masters 124,083 7.9%
PhD 10,408 0.7%
Not stated 302,241 19.1%
Higher ed status N pct
Completed education 826,652 52.3%
Undergoing education 51,141 3.2%
Discontinued education 78,037 4.9%
No registered education 624,676 39.5%

Of the 1,580,506 people in this data sample, more than half have completed some form of higher education. About half of the parents of the entire pool completed a vocational high school or college degree. There is no education status for almost 20% of parents.

To make the plot you need to data frames, one for the nodes and one with the data. The nodes frame contains all the unique values of parent and child educational attainment, or for this chart, the nodes. The sankey_plot_id frame has the frequencies and parent-to-child paths we need for the sankey band widths. Make sure to render them each as a data.frame, as that’s the format needed for the sankey functions to work.

Show code for making the sankey plot dfs
# make the nodes data frame
nodes_df <- as.data.frame(ed_attain_allage %>%
    select(parent_ed, child_ed) %>%
    pivot_longer(c("parent_ed", "child_ed"),
    names_to = "col_name", values_to = "name_match") %>%
    # this removes dupes keeps only name match
    # may need to clean names for labels to look nice
    select("name_match") %>%
    distinct())

# create ID df for order of nodes from ed_path_freqs & nodes; -1 ensures 0 index
sankey_plot_id <- as.data.frame(ed_attain_allage %>%
  mutate(IDIn = match(parent_ed, nodes_df$name_match) -1,
             IDOut = match(child_ed, nodes_df$name_match)-1))

Now we make the plot. We call the links from the df with the frequencies and the nodes fro the nodes df. Font size and family can be changed, as can the node widths and other aesthetic features. The sinksRight call places the data labels on the right side node, the end stage, outside the graph. There is no way to put the left-side labels outside the graph. Changing the size accommodates a title and caption added using htmlwidgets. Louise Sinks explained in her post that leaving the plot in native size from the sankeyNetwork call could truncate parts of the plot if you add a title.

One important thing it took some searching to solve was the order of the nodes. You can see in the data cleaning that I set them up as factors with the levels I wanted them ordered. It worked in the gt tables but initially not in the sankey. It turns out you need to add iterations = 0 to keep the auto-positioning from running.

If you hover over the node bands of the chart, you can see the total number for that path.

Show code for making the sankey plot
# create the plot as an object
edattain_sankey <-
sankeyNetwork(
    Links = sankey_plot_id, Nodes = nodes_df, 
    Source = "IDIn", Target = "IDOut",
    # "Freq" from freq df, "name" from nodes_df 
    Value = "N", NodeID = "name_match", 
    sinksRight = FALSE, iterations = 0, 
    width = 600,  height = 400,
    fontSize = 16, fontFamily = "Arial", nodeWidth = 20)

## add title and caption to plot object
edattain_sankey <-
  htmlwidgets::prependContent(
    edattain_sankey, htmltools::tags$h4("Higher education pathways for people depending on parent's educational attainment"))
edattain_sankey <-
  htmlwidgets::appendContent(
    edattain_sankey, htmltools::tags$em("Data from Danmarks Statistik table STATUSV2 via danstat package"))

edattain_sankey

Higher education pathways for people depending on parent's educational attainment

Data from Danmarks Statistik table STATUSV2 via danstat package

So what do we see? Well for parents who stopped out at the primary level, most of their kids are not recorded as having registered for higher education. Same for parents who stopped out at a vocational secondary diploma. We also see a lot of parents without any educational attainment recorded. But most of their kids have either completed higher education or started but then discontinued.

Some of this could be an artifact of the Danish personal registry system not having complete parental education records for the upper end of the 25-45 age pool here, or for parents of immigrant or 1st generation born in Denmark. Some of it could be a product of the older students here being less likely to complete higher education, something we’ve seen in earlier posts.

I’ll break down by age groups in the “complicated” prompt. Sadly we don’t have a table that shows both parent education and parent or child national origin status. I’d love to have access to micro-data to answer these more detailed questions; that would mean working for StatBank, the Education Ministry, or the EduQuant project (if it gets new funding). A man can dream…

created and posted April 18, 2025

Prompt #15 - Complicated

I took complicated to mean trying to visualize multiple dimensions. As I mentioned in prompt 14 I wanted to look at higher education completion by parent educational attainment, but broken down a bit more by age groups.

The chart for this prompt will be a waffle, which I did last year for prompts 3 & 4. We’ll use the waffle package which has a built-in geom to make waffles. Well, not make actual waffles, but a waffle plot. Sorry if this is making you hungry for waffles.

We’ll use the same data as prompt 14, so no need to redo the code to get it. We can get straight to making the charts. There was a bit of complication in making the charts, as I’ll explain.

We do need to create a dataframe to make the plot, and that’s where the first complication came in. Because of rounding, some plots went to 101, some stopped at 99. So I needed to do a bit of cleaning to make each plot even out at 100. I suppose I could have done a more nuanced rounding with an ifelse but I likely still would have had to examine the plot and check the data.

Show code for making the waffle plot data
library(tidyverse) # to do tidyverse things
library(tidylog) # to get a log of what's happening to the data
library(waffle) # to make the waffle plots
library(patchwork) # stitching together the final plots

# using the ed_attain_main df from prompt 14
ed_attain_waffle <- ed_attain_main %>%
    filter(!parent_ed == "Total") %>%
    filter(!parent_ed == "Qualifying educ paths") %>%
    filter(!child_ed == "Total") %>%
    filter(!child_age_group == "Total") %>%
    group_by(child_age_group, parent_ed) %>%
    mutate(age_grp_par_ed_sum = sum(N)) %>%
    ungroup() %>%
    group_by(child_age_group, parent_ed, child_ed) %>%
    mutate(age_par_ed_child_ed_sum = sum(N)) %>%
    ungroup() %>%
    mutate(age_par_ed_child_ed_pct = age_par_ed_child_ed_sum / age_grp_par_ed_sum) %>%
    mutate(age_par_ed_child_ed_pct2 = round(age_par_ed_child_ed_pct, 2) * 100) %>%
    distinct(parent_ed, child_age_group, child_ed, .keep_all = TRUE) %>%
    select(-child_age, -N) %>%
    mutate(parent_ed = factor(parent_ed,
        levels = c("Primary", "HS-Vocational", "HS-Academic", "Short-cycle coll",
            "Bach-Vocational", "Bach-Academic", "Masters", "PhD", "Not stated"))) %>%
    mutate(child_ed = factor(child_ed,
        levels = c("Completed education", "Undergoing education", "Discontinued education",
            "No registered education"))) %>%
    # clean up pct vals to make charts even out
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "HS-Vocational" & child_ed == "Completed education" & child_age_group == "25-29",
                40, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Short-cycle coll" & child_ed == "No registered education"
                & child_age_group == "25-29", 31, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Masters" & child_ed == "Completed education" & child_age_group == "25-29",
        69, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "HS-Academic" & child_ed == "Discontinued education"
                & child_age_group == "30-34", 9, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Short-cycle coll" & child_ed == "Completed education"
                & child_age_group == "30-34", 60, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "PhD" & child_ed == "Discontinued education"
                & child_age_group == "30-34", 4, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Primary" & child_ed == "No registered education"
                & child_age_group == "35-39", 68, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "HS-Academic" & child_ed == "No registered education"
                & child_age_group == "35-39", 39, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Short-cycle coll" & child_ed == "Discontinued education"
                & child_age_group == "35-39", 7, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Bach-Vocational" & child_ed == "Discontinued education"
                & child_age_group == "35-39", 5, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Primary" & child_ed == "No registered education"
                & child_age_group == "40-45", 67, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "HS-Academic" & child_ed == "No registered education"
                & child_age_group == "40-45", 39, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Bach-Vocational" & child_ed == "Undergoing education"
                & child_age_group == "40-45", 1, age_par_ed_child_ed_pct2)) %>%
    mutate(age_par_ed_child_ed_pct2 =
            ifelse(parent_ed == "Not stated" & child_ed == "Undergoing education"
                & child_age_group == "40-45", 1, age_par_ed_child_ed_pct2))

If we are making more than one plot with the same variables, do we copy-paste-amend the code?

Not if we can help it.

Let’s make the basic plot into a function to limit the amount of overall code and chance for error. What we’ll do here is add a filter expression to the function call.

Show code for making the plot function
waffleplot <- function(plotdf, filter_expr) {
    
    # Convert the string expression to an actual R expression
    filter_expr <- rlang::parse_expr(filter_expr)
    # Filter the data
    filtered_df <- plotdf %>% filter(!!filter_expr)
    
    # Create the plot
    ggplot(filtered_df, (aes(fill = child_ed, values = age_par_ed_child_ed_pct2))) +
        geom_waffle(na.rm=TRUE, n_rows=10, flip=TRUE, size = 0.33, colour = "white") +
        facet_wrap(~parent_ed, nrow=1,strip.position = "bottom", scales = "free_x") +
        scale_x_discrete(labels = ) +
        scale_y_continuous(labels = function(x) x * 10, # make this multipler the same as n_rows
            expand = c(0,0)) +
        scale_fill_brewer(palette = "Set2") +
        theme_minimal() +
        theme(legend.position = "bottom", legend.justification = "left",
            legend.spacing.x = unit(0, 'cm'),
            legend.key.width = unit(1, 'cm'), legend.margin=margin(-10, 0, 0, 0),
            legend.title = element_text(size = 8), legend.text = element_text(size = 8),
            panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
        guides(fill = guide_legend(label.position = "bottom",
            title = "Higher education completion status", title.position = "top"))
}

Now that we have the function, it’s easy to call. In this case we pass the data frame and age group filter. I suppose I could have added the dataframe name in the function, but since I may want to reuse the function I kept some things as generic as possible.

Calling the function is simple: waffleplot(ed_attain_waffle, "child_age_group == '25-29'"). As you can see in the code below, we’ll change the age group for each call, creating a plot object each time that we can edit before stitching together for the final plot.

Show code for making the plots
# make each individual plot
plot_2529 <- waffleplot(ed_attain_waffle, "child_age_group == '25-29'")
plot_3034 <- waffleplot(ed_attain_waffle, "child_age_group == '30-34'")
plot_3539 <- waffleplot(ed_attain_waffle, "child_age_group == '35-39'")
plot_4045 <- waffleplot(ed_attain_waffle, "child_age_group == '40-45'")

### add subtitles to the plots with age-group displayed
plot_2529 <-
plot_2529 +
    labs(subtitle = "Age group 25-29")

plot_3034 <-
    plot_3034 +
    labs(subtitle = "Age group 30-34")

plot_3539 <-
plot_3539 +
    labs(subtitle = "Age group 35-39")

plot_4045 <-
plot_4045 +
    labs(subtitle = "Age group 40-45")

# usepatchwork to stitch the plots together and add general annotations.
plot_2529 + plot_3034 +  plot_3539 + plot_4045 +
    plot_annotation(
        title = "In Denmark, regardless of age, the likelihood of completing higher education increases as level of parent education increases.",
        subtitle = "Higher education completion status for people ages 25-45, by age group and parent educational attainment, 2023. Each block = 1 %",
        caption = "Data from Danmarks Statistik table STATUSV2 via danstat package")

waffle plot displaying higher ed completion by age group and parent education

One complication I could not adequately solve is the size of the x-axis labels. There’s something in the facet_wrap call to put together the parent ed levels that does not allow me to change the size the way I could in a stand-alone plot. Not sure why, so we’ll have to live with the truncated axis labels. Otherwise I’m happy with the output.

Also, for the legend, there was no way I know of to include it in only one plot that didn’t change the proportions of that plot relative to the others, throwing off the symmetry in the stitched-together plots. So the legend is in every plot. I don’t think it’s possible to include it as an annotation or extra object in the stitched plot, but maybe there’s a hack I don’t know of in patchwork.

A few things jump out here.

  • First, the pattern is the same regardless of age. If your parents have higher levels of education, you are more likely to complete higher education. I know from my career doing higher ed policy analysis in the US, that the pattern is the same there.
  • Not surprisingly the younger age group, 25-29, includes more people who are still in the proces of finishing their college education (the orange blocks).
  • The discontinued rates (blue blocks) seemed similar-ish across age-groups, so the likelihood of stopping out does not look to be affected by parental education level.

Of course if this were a more in-depth report there’d be some descriptive tables to set the stage. And again, if I had access to the micro-data I’d be running regression models to explain factors leading to higher completion rates.

created and posted April 20, 2025