Common
Data Transformations
for Graphing

Data 304

Common Data Transformations

Subsetting rows and/or columns
Computing new variables, possibly within groups (aggregation)
Wide vs long format
Merging multiple data sources

We’ll focus on 3 and 4 since we have already done 1 and 2 in Vega-Lite.

Options

R/tidyverse
Python/pandas
Vega-lite
vegabrite
altair/Altair
SQL
etc, etc.

Similar concepts, but some annoying inconsistency in naming things.

R vs Python vs SQL

We’ll focus on R/tidyverse rather than Python here because most of you have been doing your work in R rather than Ptyhon. But…

Python/pandas is mimicking R/tidyverse, which mimicked SQL

Python has begun mimicking R/tidyverse recently.

Example: group_by() |> summarise() $\to$ .groupby().agg()

The basic ideas (and often the names) go back to SQL and its operators.

A nice comparison of R vs Python code for similar tasks can be found here.

Where to do transformations

Main advantages to data transformation inside Veag-Lite:

If we transform in R/Python, we need to either include the data in our JSON (making it potentially large) or save the data and tell Vega-Lite where to find it. No need for either if we transform in Vega-Lite.
We may want to use composition (concat, layers, repeat) with different transforamtions.

Main advantages to transforming in R or Python:

Easier to inspect the data to make sure transformation is working.
Some complicated data transformations are simpler to implement.
Can use the same toolkit whether transforming for a graphic or for some other reason.

Subsetting Rows

R tidyverse: dplyr::filter()
Python/pandas: .query()
Vega-Lite: "transform": [{"filter": ...}, ...]
vegabrite: vl_filter()
Altair: Chart.transform_filter()
altair: Chart$transform_filter()

Subsetting Columns

R/tidyverse: dplyr::select()
Python/pandas: .loc()

Creating new variables (columns)

R/tidyverse: dplyr::mutate()
Pyhon/pandas: .assign()
Vega-Lite: "transform": [{"calculate": ...}, ...]
vegabrite: vl_calculate()
Altair: Chart.transform_calculate()
altair: Chart$transform_calculate()

Wide vs Long

Converting between wide and long forms of the same data is a commmon wrangling step for both analysis and visualization.

Exercise 1 For what sorts of visualization would each of these formats be better?

Wide $\to$ Long

Some wide data

URL: https://calvin-data304.netlify.app/data/medals-wide.csv

Code

library(dplyr)
library(tidyr)
medals_wide <- readr::read_csv("https://calvin-data304.netlify.app/data/medals-wide.csv")
medals_wide |> pander::pander()

country	gold	silver	bronze
Norway	14	14	11
Germany	14	10	7
Canada	11	8	10

Exercise 2 What must we specify to convert this to long format?

Solution 1.

Which variables (and their values) are being “tipped”
Which are to be repeated (“id” variables)
How to name things in the result

Converting: wide $\to$ long

R: dplyr::pivot_longer()
Python: pd.wide_to_long()
Vega-Lite: {"transform": [{"fold": ... }, ...]
vegabrite: vl_fold()
Altair: Chart.transform_fold()
altair: Chart$transform_fold()

library(tidyr)
medals_wide |>
  pivot_longer(gold:bronze, names_to = "medal", values_to = "count") |>
  pander::pander()

country	medal	count
Norway	gold	14
Norway	silver	14
Norway	bronze	11
Germany	gold	14
Germany	silver	10
Germany	bronze	7
Canada	gold	11
Canada	silver	8
Canada	bronze	10

library(vegabrite)

vb1 <- vl_chart() |>
  vl_fold(
    c("bronze", "silver", "gold"),
    as = c("medal", "count"))  |>   
  vl_mark_line(point = TRUE) |>      # Note: shortcut for line + point!
  vl_encode_x("medal:O") |>
  vl_encode_y("count:Q") |>
  vl_encode_color("country:N") |>
  vl_add_data(medals_wide) |> 
  vl_add_properties(width = 600, height = 150)
vb1

library(altair)
chart1 <- alt$Chart(medals_wide)$
  transform_fold(
    c("bronze", "silver", "gold"),
    as_ = c("medal", "count")             # Note: underscore
  )$
  mark_line(point = TRUE)$
  encode(                        # Note: shortcut for line + point!
    x = "medal:O",
    y = "count:Q",
    color = "country:N")

chart1$properties(width = 600, height = 150)

import altair as alt
chart1 = alt.Chart(medals_wide).transform_fold(
  ["bronze", "silver", "gold"],
  as_ = ["medal", "count"]             # Note: underscore
  ).mark_line(point = True).encode(    # Note: shortcut for line + point!
    alt.X(field = "medal", type = "ordinal"),
    alt.Y(field = "count", type = "quantitative"),
    alt.Color(field = "country", type = "nominal")
    )

chart1.properties(width = 600, height = 150)

Hmm…

Exercise 3 What is bad about this graphic?

Repairing the order of the medals

Exercise 4 How do we change the display order to bronze, silver, gold?

Solution 2. Set the domain of the scale.

vegabrite
Altair

vb1 |>
  vl_scale_x(domain = c("bronze", "silver", "gold")) |>
  vl_scale_color(domain = c("Norway", "Germany", "Canada"))

chart2 = alt.Chart(medals_wide).transform_fold(
  ["bronze", "silver", "gold"], as_ = ["medal", "count"] 
  ).mark_line(point = True).encode(
    alt.X(field = "medal", type = "ordinal").scale(
      domain = ["bronze", "silver", "gold"]),
    alt.Y(field = "count", type = "quantitative"),
    alt.Color(field = "country", type = "nominal")
    )

chart2.properties(width = 600, height = 150)

Let’s try again

Exercise 5 How else could we achieve this?

Solution 3. We could use sort instead. Sort also allows us to compute an order rather than specify it explicitly.

vegabrite
Altair
Vega-Lite

vb1 |>
  vl_sort_x(c("bronze", "silver", "gold")) |>
  vl_sort_color_by_field("count", op = "sum", order = "descending")

chart3 = alt.Chart(medals_wide).transform_fold(
  ["gold", "silver", "bronze"], as_ = ["medal", "count"]            
  ).mark_line(point = True).encode(
    alt.X("medal:O", sort = ["gold", "silver", "bronze"]),  # sort!
    alt.Y("count:Q"),
    alt.Color(
      "country:N", 
      sort = alt.EncodingSortField(
        field = "count", op = "sum", order = "descending") )
    )

chart3.properties(width = 600, height = 150)

print(chart3.to_json())

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.20.1.json",
  "config": {
    "view": {
      "continuousHeight": 300,
      "continuousWidth": 300
    }
  },
  "data": {
    "name": "data-fe97747cf726de8a07e6c8ec7fd7c1de"
  },
  "datasets": {
    "data-fe97747cf726de8a07e6c8ec7fd7c1de": [
      {
        "bronze": 11,
        "country": "Norway",
        "gold": 14,
        "silver": 14
      },
      {
        "bronze": 7,
        "country": "Germany",
        "gold": 14,
        "silver": 10
      },
      {
        "bronze": 10,
        "country": "Canada",
        "gold": 11,
        "silver": 8
      }
    ]
  },
  "encoding": {
    "color": {
      "field": "country",
      "sort": {
        "field": "count",
        "op": "sum",
        "order": "descending"
      },
      "type": "nominal"
    },
    "x": {
      "field": "medal",
      "sort": [
        "gold",
        "silver",
        "bronze"
      ],
      "type": "ordinal"
    },
    "y": {
      "field": "count",
      "type": "quantitative"
    }
  },
  "mark": {
    "point": true,
    "type": "line"
  },
  "transform": [
    {
      "as": [
        "medal",
        "count"
      ],
      "fold": [
        "gold",
        "silver",
        "bronze"
      ]
    }
  ]
}

Native Vega-Lite for previous graphic

{
  ...
  "encoding": { 
    "x": { "field": "medal", "type": "ordinal" },
    "y": { "field": "count", "type": "quantitative" },
    "color": { "field": "country", "type": "nominal" } 
  },
  "mark": { "type": "line" },
  "transform": [
    {
      "as": [ "medal", "count" ],
      "fold": [ "gold", "silver", "bronze" ]
    }
  ]
}

Native Vega-Lite for setting scale’s domain

{ ...
  "encoding": {
    "x": {
      "field": "medal",
      "type": "ordinal"
      "scale": { "domain": [ "gold", "silver", "bronze" ]},
    ...
  }, ...
}

Native Vega-Lite for sort

{ ...,
  "encoding": { ..., 
    "x": {
      "field": "medal", "type": "ordinal",
      "sort": [ "gold", "silver", "bronze" ]
    }, ...
}

Simple Tooltips

Big ideas:

Tooltips are encodings, just like x, y, color, etc.
Provide array of fields to display in the tooltip.

Fancier tooltips are possible with more customization of what is displayed.

vegabrite
Altair
altair

vb1 |>
  vl_encode_tooltip_array(c("count:Q", "medal:N", "country:N"))

chart1.encode(
  alt.Tooltip(["medal:N", "country:N", "count:Q"])
)

chart1$encode(
  alt$Tooltip(c("medal:N", "country:N", "count:Q"))
)

Long $\to$ Wide

Long data

URL: https://calvin-data304/data/medals-long.csv

country	medal	count
Norway	gold	14
Norway	silver	14
Norway	bronze	11
Germany	gold	14
Germany	silver	10
Germany	bronze	7
Canada	gold	11
Canada	silver	8
Canada	bronze	10

Converting Long $\to$ Wide

R: dplyr::pivot_wider()
Python: pd.pivot() or pd.pivot_table()
Vega-Lite: {"transform": [{"pivot": ...}]}
vegabrite: vl_pivot()
Altair: Chart.transform_pivot()
altair: Chart$transform_pivot()

Converting Long $\to$ Wide

R/tidyverse
vegabrite
Altair
Vega-Lite

medals_long |>
  pivot_wider(names_from = "medal", values_from = "count") |>
  pander::pander()

country	gold	silver	bronze
Norway	14	14	11
Germany	14	10	7
Canada	11	8	10

vl_chart() |>
  # note required use of list() here!
  vl_pivot(pivot = "medal", groupby = list("country"), value = "count") |>
  vl_mark_bar(fill = "gold") |>
  vl_encode_x("gold:Q") |>
  vl_encode_y("country:N") |>
  vl_add_data(medals_long)

alt.Chart(medals_long).transform_pivot(
    "medal",
    groupby=["country"],
    value="count"
).mark_bar(fill = "gold").encode(
    x="gold:Q",
    y="country:N",
)

{ ...,
  "transform": [
    {
      "groupby": [ "country" ],
      "pivot": "medal",
      "value": "count"
    }
  ], ...
}

Another example: Faceting via repeat

vegabrite
Altair

vl_chart() |>
  vl_pivot("medal", groupby = list('country'), value = "count") |>
  vl_repeat_col("gold", "silver") |>
  vl_mark_bar() |>
  vl_encode_x(field = list(`repeat` = "column"), type = "quantitative") |>
  vl_encode_y("country:N") |>
  vl_add_data(medals_long)

alt.Chart(medals_long).transform_pivot(
    "medal",
    groupby=["country"],
    value="count"
).mark_bar().encode(
    alt.X(field = alt.repeat("column"), type = "quantitative"),
    y="country:N",
).repeat(
  column = ["gold", "silver"]
)

Filter and facets

We could also do this with filter and facets.

vegabrite
Altair

vl_chart() |>
  vl_mark_bar() |>
  vl_filter("datum.medal == 'gold' | datum.medal == 'silver'") |>
  vl_encode_column("medal") |>
  vl_encode_x("count:Q") |>
  vl_encode_y("country:N") |>
  vl_add_data(medals_long)

Your turn

Data URL: <“https://cdn.jsdelivr.net/npm/vega-datasets@2.8.0/data/jobs.json”>

Exercise 7 Use these data to create some visualizations

Scatterplot showing the percent of men and women in various occupations 1950 and 2000.
- Add a tooltip
- Bonus: Add interaction to select the two years.
Time series plot showing the percent of men and women in different occupations over time.
- Start with all the occupations, then decide if you want to restrict your attention to just some occupations.
Create additional graphics with these data.

Examples

Joins: Using data from multiple sources

Some data sets

Each of the following data sets contains two columns. The first column is country and the second is one of the following.

The data come from the CIA World Factbook and are 10-20 years old.

Example use case: a scatter plot

Suppose you want to make a scatter plot for two of these measures.

Problem: The variables are in different data sets.
Solution: Join the data sets together so we have a data set with 3 columns: country and two of these measures, like this:

country obesity GDP

American Samoa 74.6 575300000

Nauru 71.1 6e+07

Cook Islands 63.7 183200000

Joining two data sets

Let’s call them left and right because one will be written first (left) and the other second (right) when we join them in code.

Once we have selected the data sets, we still have two decisions to make:

What column(s) will be used to identify rows that “match”?
What will we do with rows in one data set that don’t have any matches in the other data set?

Types of joins

The most important joins for us are:

inner join: keep only rows that have a match
left (outer) join: keep all rows from left, fill in with NA when there is no match in right
right (outer) join: keep all rows from right, fill in with NA when there is no match in left
full (outer) join: include all rows from both, filling in NA when things don’t match

For data visualization purpposes, a left join is very common.

only one implemented in Vega-Lite, where it is called a “lookup” transformation.
can be used to “lookup” additional information to add to your graphic.

Visualizing Joins

This page has some nice annimations of join operations.

This shiny app includes a nice visualization of how joins work.

Implementing a join

R/tidyverse
vegabrite
altair
Altair
Vega-Lite

country	obesity	GDP
American Samoa	74.6	575300000
Nauru	71.1	6e+07
Cook Islands	63.7	183200000

By the way:

by can be a vector of column names
If the names don’t match, we can use by = c("left name" = "right name")
If you don’t specify by, then all columns with matching names are used.

Your turn

Exercise 8 Make the following modifications to the obesity vs GDP plot.

Add a tooltip so you can tell which contry is which.
Use a logrithmic scale where appropriate.
Format the GDP axis using “.0s” as the format string.
Color or facet the points by continent. (You will need to use this data or something similar to get the region information.

You might notice something unexpected in this plot. What is it? Why is it happening? What could be done to fix it?

Exercise 9 Create a graphic using a different pair of variables.

Common
Data Transformations
for Graphing