Data 304: Visualizing Data and Models
Ingredients:
Improvements?
vl_chart(width = 150, height = 60) |>
vl_add_data(penguins |> filter(!is.na(sex))) |>
vl_encode_x("body_mass_g:Q", bin = list(maxbins = 30)) |>
vl_encode_y(aggregate = "count") |>
vl_axis_x(title = NA, values = (2:6) * 1000) |>
vl_scale_x(nice = TRUE) |>
vl_axis_y(title = NA) |>
vl_encode_column("species:N", title = NA) |>
vl_encode_row("sex:N", title = NA) |>
vl_mark_bar() |>
vl_vconcat(
vl_chart(height = 15, width = 500) |>
vl_add_data(tibble(x = 1, y = 1)) |>
vl_mark_text(size = 14) |>
vl_encode_text(value = "body mass (g)") |>
vl_encode_x("x:N", axis = FALSE) |>
vl_encode_y("y:N", axis = FALSE) |>
vl_config_view(stroke = "transparent")
)
Ingredients:
Usual method: kernel density estimation (kde)
set.seed(123)
S <- tibble(x = round(rgamma(10, shape = 2, rate = 0.1), 1))
kde_demo <- function(bw = 6) {
base <-
vl_chart(width = 300, height = 100) |>
vl_add_properties(title = paste0("bandwidth: ", bw)) |>
vl_add_data(S) |>
vl_encode_y("density:Q") |>
vl_scale_y(domain = c(0,.5))
kde <- base |>
vl_density("x", as = list("value", "density"),
bandwidth = bw, counts = TRUE, extent = c(0, 60)) |>
vl_mark_line(opacity = 2) |>
vl_encode_color(value = "red") |>
vl_encode_x("value:Q")
kernels <- base |>
vl_density(
"x", as = list("value", "density"),
bandwidth = bw, groupby = list("x"),
extent = c(-3, 60), counts = TRUE) |>
vl_mark_line(opacity = 0.5) |>
vl_encode_detail("x:N") |>
vl_encode_x("value:Q")
ticks <-
base |>
vl_mark_tick() |>
vl_encode(x = "x") |>
vl_encode_y(datum = 0)
kernels + kde + ticks
}
((kde_demo(2) | kde_demo(3)) &
(kde_demo(6) | kde_demo(9)))
Ingredients:
bandwidth
determines amount of smoothinggroupby
to compute separate densities for each groupData 304: Visualizing Data and Models
We have focused mainly on visualizing data so far.
Examples: Two-sample t procedures, ANOVA
Method: aggregate or precompute
Warning
Beware the tendency to summarize everything with means (and to use bars to display them). This is a common approach to visualization, but it hides variation, which is the other side of the coin.
Vega-Lite can handle some basic regression models with the regression transform.
base <-
vl_chart(width = 500) |>
vl_add_data(penguins) |>
vl_encode_x("body_mass_g:Q", scale = list(zero = FALSE)) |>
vl_encode_y("flipper_length_mm:Q", scale = list(zero = FALSE))
points <- base |> vl_mark_circle(opacity = 0.3, size = 15)
line <- base |>
vl_regression("body_mass_g", on = "flipper_length_mm") |>
vl_mark_line()
points + line
LOESS = LOcally Estimated Scatterplot Smoothing
Big ideas:
base <-
vl_chart() |>
vl_add_data(penguins) |>
vl_encode_x("body_mass_g:Q", scale = list(zero = FALSE)) |>
vl_encode_y("flipper_length_mm:Q", scale = list(zero = FALSE))
points <- base |> vl_mark_circle(opacity = 0.3, size = 15)
line <- function(bw = 0.2, color = "steelblue", opacity = 0.7) {
base |>
vl_loess("body_mass_g", on = "flipper_length_mm", bandwidth = bw) |>
vl_mark_line(color = color, opacity = opacity) |>
vl_add_properties(title = paste0("bandwidth = ", bw))
}
vl_concat(
points + line(0.1, "red"),
points + line(0.2, "forestgreen"),
points + line(0.5, "navy"),
points + line(0.8, "black")
)
Consider showing both the data and the model.
Include representation of model uncertainy.
Perform model diagnostics by visualizing data related to the model.
Surprise and Scale
Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it can’t surprise you. (Hadley Wickham, paraphrased)
library(broom)
model <-
lm(flipper_length_mm ~ body_mass_g + sex + species,
data = penguins)
tidy(model)
# A tibble: 5 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 165. 3.18 51.7 1.01e-159
2 body_mass_g 0.00655 0.000931 7.04 1.15e- 11
3 sexmale 2.48 0.854 2.90 3.97e- 3
4 speciesChinstrap 5.54 0.785 7.06 9.92e- 12
5 speciesGentoo 18.0 1.44 12.5 1.46e- 29
# A tibble: 2 × 11
.rownames flipper_length_mm body_mass_g sex species .fitted .resid .hat
<chr> <int> <int> <fct> <fct> <dbl> <dbl> <dbl>
1 1 181 3750 male Adelie 192. -10.6 0.0124
2 2 186 3800 female Adelie 189. -3.48 0.0154
# ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
base <-
vl_chart() |>
vl_encode_x("body_mass_g:Q", scale = list(zero = FALSE)) |>
vl_encode_y("flipper_length_mm:Q", scale = list(zero = FALSE))
points <-
base |>
vl_mark_point()
line <-
base |>
vl_encode_y(".fitted:Q") |>
vl_mark_line()
area <- base |>
vl_mark_area(opacity = 0.5) |>
vl_encode_y2("upper:Q") |>
vl_encode_y("lower:Q")
(points + area + line) |>
vl_facet_row("sex:N") |>
vl_facet_column("species:N") |>
vl_add_properties(width = 150, height = 50) |>
vl_add_data(
augment(model) |>
mutate(upper = .fitted + .sigma) |>
mutate(lower = .fitted - .sigma)
)
Exercise 1 The anscombe
data set (in Vega Data Sets, also in R) contains 4 x-y pairs.
Make a graphic showing scatter plots with regression lines for all four pairs. What is the point of this (articical) data set?
Exercise 2 Make a list of visualizations “of models” that you have seen (in other classes, in papers, etc.). Describe them using the grammar of graphics and the data required to make them. (You may find it handy to draw a sketch for each.)
Exercise 3 Open one (or more) of the graphics in these slides that uses a transform and inspect the data in the data viewer to see what the data look like post transformation.
Exercise 4 Return to the penguins data and create a scatter plot with multiple regression or loess lines. What groups of penguins should you use.
Exercise 5 Describe a qq-plot in terms of the grammar of graphics and the data you need to make one. (You migth start by reminding yourself – or asking someone – what a qq-plot is.)
Create a qq-plot (you may choose the data; perhaps use one of the features of penguins) using the quantile transform.