Code
country | gold | silver | bronze |
---|---|---|---|
Norway | 14 | 14 | 11 |
Germany | 14 | 10 | 7 |
Canada | 11 | 8 | 10 |
Data 304
We’ll focus on 3 and 4 since we have already done 1 and 2 in Vega-Lite.
Similar concepts, but some annoying inconsistency in naming things.
We’ll focus on R/tidyverse rather than Python here because most of you have been doing your work in R rather than Ptyhon. But…
Python/pandas is mimicking R/tidyverse, which mimicked SQL
Python has begun mimicking R/tidyverse recently.
group_by() |> summarise()
\(\to\) .groupby().agg()
The basic ideas (and often the names) go back to SQL and its operators.
A nice comparison of R vs Python code for similar tasks can be found here.
If we transform in R/Python, we need to either include the data in our JSON (making it potentially large) or save the data and tell Vega-Lite where to find it. No need for either if we transform in Vega-Lite.
We may want to use composition (concat, layers, repeat) with different transforamtions.
Easier to inspect the data to make sure transformation is working.
Some complicated data transformations are simpler to implement.
Can use the same toolkit whether transforming for a graphic or for some other reason.
dplyr::filter()
.query()
"transform": [{"filter": ...}, ...]
vl_filter()
Chart.transform_filter()
Chart$transform_filter()
dplyr::select()
.loc()
dplyr::mutate()
.assign()
"transform": [{"calculate": ...}, ...]
vl_calculate()
Chart.transform_calculate()
Chart$transform_calculate()
Converting between wide and long forms of the same data is a commmon wrangling step for both analysis and visualization.
Exercise 1 For what sorts of visualization would each of these formats be better?
URL: https://calvin-data304.netlify.app/data/medals-wide.csv
country | gold | silver | bronze |
---|---|---|---|
Norway | 14 | 14 | 11 |
Germany | 14 | 10 | 7 |
Canada | 11 | 8 | 10 |
Exercise 2 What must we specify to convert this to long format?
Solution 1.
dplyr::pivot_longer()
pd.wide_to_long()
{"transform": [{"fold": ... }, ...]
vl_fold()
Chart.transform_fold()
Chart$transform_fold()
library(vegabrite)
vb1 <- vl_chart() |>
vl_fold(
c("bronze", "silver", "gold"),
as = c("medal", "count")) |>
vl_mark_line(point = TRUE) |> # Note: shortcut for line + point!
vl_encode_x("medal:O") |>
vl_encode_y("count:Q") |>
vl_encode_color("country:N") |>
vl_add_data(medals_wide) |>
vl_add_properties(width = 600, height = 150)
vb1
library(altair)
chart1 <- alt$Chart(medals_wide)$
transform_fold(
c("bronze", "silver", "gold"),
as_ = c("medal", "count") # Note: underscore
)$
mark_line(point = TRUE)$
encode( # Note: shortcut for line + point!
x = "medal:O",
y = "count:Q",
color = "country:N")
chart1$properties(width = 600, height = 150)
import altair as alt
chart1 = alt.Chart(medals_wide).transform_fold(
["bronze", "silver", "gold"],
as_ = ["medal", "count"] # Note: underscore
).mark_line(point = True).encode( # Note: shortcut for line + point!
alt.X(field = "medal", type = "ordinal"),
alt.Y(field = "count", type = "quantitative"),
alt.Color(field = "country", type = "nominal")
)
chart1.properties(width = 600, height = 150)
Exercise 3 What is bad about this graphic?
Exercise 4 How do we change the display order to bronze, silver, gold?
Solution 2. Set the domain of the scale.
chart2 = alt.Chart(medals_wide).transform_fold(
["bronze", "silver", "gold"], as_ = ["medal", "count"]
).mark_line(point = True).encode(
alt.X(field = "medal", type = "ordinal").scale(
domain = ["bronze", "silver", "gold"]),
alt.Y(field = "count", type = "quantitative"),
alt.Color(field = "country", type = "nominal")
)
chart2.properties(width = 600, height = 150)
Exercise 5 How else could we achieve this?
Solution 3. We could use sort instead. Sort also allows us to compute an order rather than specify it explicitly.
chart3 = alt.Chart(medals_wide).transform_fold(
["gold", "silver", "bronze"], as_ = ["medal", "count"]
).mark_line(point = True).encode(
alt.X("medal:O", sort = ["gold", "silver", "bronze"]), # sort!
alt.Y("count:Q"),
alt.Color(
"country:N",
sort = alt.EncodingSortField(
field = "count", op = "sum", order = "descending") )
)
chart3.properties(width = 600, height = 150)
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.20.1.json",
"config": {
"view": {
"continuousHeight": 300,
"continuousWidth": 300
}
},
"data": {
"name": "data-fe97747cf726de8a07e6c8ec7fd7c1de"
},
"datasets": {
"data-fe97747cf726de8a07e6c8ec7fd7c1de": [
{
"bronze": 11,
"country": "Norway",
"gold": 14,
"silver": 14
},
{
"bronze": 7,
"country": "Germany",
"gold": 14,
"silver": 10
},
{
"bronze": 10,
"country": "Canada",
"gold": 11,
"silver": 8
}
]
},
"encoding": {
"color": {
"field": "country",
"sort": {
"field": "count",
"op": "sum",
"order": "descending"
},
"type": "nominal"
},
"x": {
"field": "medal",
"sort": [
"gold",
"silver",
"bronze"
],
"type": "ordinal"
},
"y": {
"field": "count",
"type": "quantitative"
}
},
"mark": {
"point": true,
"type": "line"
},
"transform": [
{
"as": [
"medal",
"count"
],
"fold": [
"gold",
"silver",
"bronze"
]
}
]
}
{
...
"encoding": {
"x": { "field": "medal", "type": "ordinal" },
"y": { "field": "count", "type": "quantitative" },
"color": { "field": "country", "type": "nominal" }
},
"mark": { "type": "line" },
"transform": [
{
"as": [ "medal", "count" ],
"fold": [ "gold", "silver", "bronze" ]
}
]
}
{ ...
"encoding": {
"x": {
"field": "medal",
"type": "ordinal"
"scale": { "domain": [ "gold", "silver", "bronze" ]},
...
}, ...
}
Big ideas:
Fancier tooltips are possible with more customization of what is displayed.
URL: https://calvin-data304/data/medals-long.csv
country | medal | count |
---|---|---|
Norway | gold | 14 |
Norway | silver | 14 |
Norway | bronze | 11 |
Germany | gold | 14 |
Germany | silver | 10 |
Germany | bronze | 7 |
Canada | gold | 11 |
Canada | silver | 8 |
Canada | bronze | 10 |
dplyr::pivot_wider()
pd.pivot()
or pd.pivot_table()
{"transform": [{"pivot": ...}]}
vl_pivot()
Chart.transform_pivot()
Chart$transform_pivot()
Exercise 6 How else could we do something similar?
We could also do this with filter and facets.
Data URL: <“https://cdn.jsdelivr.net/npm/vega-datasets@2.8.0/data/jobs.json”>
Exercise 7 Use these data to create some visualizations
Scatterplot showing the percent of men and women in various occupations 1950 and 2000.
Time series plot showing the percent of men and women in different occupations over time.
Create additional graphics with these data.
Each of the following data sets contains two columns. The first column is country and the second is one of the following.
The data come from the CIA World Factbook and are 10-20 years old.
Suppose you want to make a scatter plot for two of these measures.
Problem: The variables are in different data sets.
Solution: Join the data sets together so we have a data set with 3 columns: country and two of these measures, like this:
country | obesity | GDP |
---|---|---|
American Samoa | 74.6 | 575300000 |
Nauru | 71.1 | 6e+07 |
Cook Islands | 63.7 | 183200000 |
Let’s call them left
and right
because one will be written first (left
) and the other second (right
) when we join them in code.
Once we have selected the data sets, we still have two decisions to make:
What column(s) will be used to identify rows that “match”?
What will we do with rows in one data set that don’t have any matches in the other data set?
The most important joins for us are:
left
, fill in with NA
when there is no match in right
right
, fill in with NA
when there is no match in left
NA
when things don’t matchFor data visualization purpposes, a left join is very common.
This page has some nice annimations of join operations.
This shiny app includes a nice visualization of how joins work.
country | obesity | GDP |
---|---|---|
American Samoa | 74.6 | 575300000 |
Nauru | 71.1 | 6e+07 |
Cook Islands | 63.7 | 183200000 |
By the way:
by
can be a vector of column namesby = c("left name" = "right name")
by
, then all columns with matching names are used.See also <
vb_countries <-
vl_chart() |>
vl_lookup(
lookup = "country", # by-variable in primary data
from = list(
data = list(url = "https://calvin-data304.netlify.app/data/GDP.csv"),
key = "country",
fields = list("GDP")
)) |>
vl_mark_point() |>
vl_encode_x("GDP:Q") |>
vl_encode_y("obesity:Q") |>
vl_add_properties(width = 600, height = 150) |>
vl_add_data_url("https://calvin-data304.netlify.app/data/obesity.csv")
vb_countries
data_url <- "https://calvin-data304.netlify.app/data/obesity.csv"
lookup_url <- "https://calvin-data304.netlify.app/data/GDP.csv"
alt$Chart(data_url)$
transform_lookup(
lookup = "country", # by-variable in primary data
from_ = alt$LookupData(
data = list(url = lookup_url),
key = "country",
fields = list("GDP")
))$
mark_point(color = "purple")$
encode(
x = "GDP:Q",
y = "obesity:Q"
)$
properties(width = 600, height = 150)
See also https://altair-viz.github.io/user_guide/transform/lookup.html
data_url = "https://calvin-data304.netlify.app/data/obesity.csv"
lookup_url = "https://calvin-data304.netlify.app/data/GDP.csv"
(
alt.Chart(data_url)
.transform_lookup(
lookup = "country", # by-variable in primary data
from_ = alt.LookupData(
data = {"url": lookup_url},
key = "country",
fields = ["GDP"]
))
.mark_point(color = "red")
.encode(
x = "GDP:Q",
y = "obesity:Q"
)
.properties(width = 600, height = 150)
)
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"transform": [
{
"lookup": "country",
"from": {
"data": {
"url": "https://calvin-data304.netlify.app/data/GDP.csv"
},
"key": "country",
"fields": [
"GDP"
]
}
}
],
"mark": {
"type": "point"
},
"encoding": {
"x": {
"field": "GDP",
"type": "quantitative"
},
"y": {
"field": "obesity",
"type": "quantitative"
}
},
"height": 150,
"width": 600,
"data": {
"url": "https://calvin-data304.netlify.app/data/obesity.csv"
}
}
Exercise 8 Make the following modifications to the obesity vs GDP plot.
Add a tooltip so you can tell which contry is which.
Use a logrithmic scale where appropriate.
Format the GDP axis using “.0s” as the format string.
Color or facet the points by continent. (You will need to use this data or something similar to get the region information.
You might notice something unexpected in this plot. What is it? Why is it happening? What could be done to fix it?
Exercise 9 Create a graphic using a different pair of variables.