Common
Data Transformations
for Graphing

Data 304

Common Data Transformations

  1. Subsetting rows and/or columns
  2. Computing new variables, possibly within groups (aggregation)
  3. Wide vs long format
  4. Merging multiple data sources

We’ll focus on 3 and 4 since we have already done 1 and 2 in Vega-Lite.

Options

  1. R/tidyverse
  2. Python/pandas
  3. Vega-lite
  4. vegabrite
  5. altair/Altair
  6. SQL
  7. etc, etc.

Similar concepts, but some annoying inconsistency in naming things.

R vs Python vs SQL

We’ll focus on R/tidyverse rather than Python here because most of you have been doing your work in R rather than Ptyhon. But…

Python/pandas is mimicking R/tidyverse, which mimicked SQL

Python has begun mimicking R/tidyverse recently.

  • Example: group_by() |> summarise() \(\to\) .groupby().agg()

The basic ideas (and often the names) go back to SQL and its operators.

A nice comparison of R vs Python code for similar tasks can be found here.

Where to do transformations

Main advantages to data transformation inside Veag-Lite:

  1. If we transform in R/Python, we need to either include the data in our JSON (making it potentially large) or save the data and tell Vega-Lite where to find it. No need for either if we transform in Vega-Lite.

  2. We may want to use composition (concat, layers, repeat) with different transforamtions.

Main advantages to transforming in R or Python:

  1. Easier to inspect the data to make sure transformation is working.

  2. Some complicated data transformations are simpler to implement.

  3. Can use the same toolkit whether transforming for a graphic or for some other reason.

Subsetting Rows

image/svg+xml

  • R tidyverse: dplyr::filter()
  • Python/pandas: .query()
  • Vega-Lite: "transform": [{"filter": ...}, ...]
  • vegabrite: vl_filter()
  • Altair: Chart.transform_filter()
  • altair: Chart$transform_filter()

Subsetting Columns

image/svg+xml

  • R/tidyverse: dplyr::select()
  • Python/pandas: .loc()

Creating new variables (columns)

  • R/tidyverse: dplyr::mutate()
  • Pyhon/pandas: .assign()
  • Vega-Lite: "transform": [{"calculate": ...}, ...]
  • vegabrite: vl_calculate()
  • Altair: Chart.transform_calculate()
  • altair: Chart$transform_calculate()

Wide vs Long

Converting between wide and long forms of the same data is a commmon wrangling step for both analysis and visualization.

Exercise 1 For what sorts of visualization would each of these formats be better?

Wide \(\to\) Long

Some wide data

URL: https://calvin-data304.netlify.app/data/medals-wide.csv

Code
library(dplyr)
library(tidyr)
medals_wide <- readr::read_csv("https://calvin-data304.netlify.app/data/medals-wide.csv")
medals_wide |> pander::pander()
country gold silver bronze
Norway 14 14 11
Germany 14 10 7
Canada 11 8 10

Exercise 2 What must we specify to convert this to long format?

Solution 1.

  • Which variables (and their values) are being “tipped”
  • Which are to be repeated (“id” variables)
  • How to name things in the result

Converting: wide \(\to\) long

  • R: dplyr::pivot_longer()
  • Python: pd.wide_to_long()
  • Vega-Lite: {"transform": [{"fold": ... }, ...]
  • vegabrite: vl_fold()
  • Altair: Chart.transform_fold()
  • altair: Chart$transform_fold()

Converting: wide \(\to\) long

library(tidyr)
medals_wide |>
  pivot_longer(gold:bronze, names_to = "medal", values_to = "count") |>
  pander::pander()
country medal count
Norway gold 14
Norway silver 14
Norway bronze 11
Germany gold 14
Germany silver 10
Germany bronze 7
Canada gold 11
Canada silver 8
Canada bronze 10
library(vegabrite)

vb1 <- vl_chart() |>
  vl_fold(
    c("bronze", "silver", "gold"),
    as = c("medal", "count"))  |>   
  vl_mark_line(point = TRUE) |>      # Note: shortcut for line + point!
  vl_encode_x("medal:O") |>
  vl_encode_y("count:Q") |>
  vl_encode_color("country:N") |>
  vl_add_data(medals_wide) |> 
  vl_add_properties(width = 600, height = 150)
vb1 
library(altair)
chart1 <- alt$Chart(medals_wide)$
  transform_fold(
    c("bronze", "silver", "gold"),
    as_ = c("medal", "count")             # Note: underscore
  )$
  mark_line(point = TRUE)$
  encode(                        # Note: shortcut for line + point!
    x = "medal:O",
    y = "count:Q",
    color = "country:N")

chart1$properties(width = 600, height = 150)
import altair as alt
chart1 = alt.Chart(medals_wide).transform_fold(
  ["bronze", "silver", "gold"],
  as_ = ["medal", "count"]             # Note: underscore
  ).mark_line(point = True).encode(    # Note: shortcut for line + point!
    alt.X(field = "medal", type = "ordinal"),
    alt.Y(field = "count", type = "quantitative"),
    alt.Color(field = "country", type = "nominal")
    )

chart1.properties(width = 600, height = 150)

Hmm…

Exercise 3 What is bad about this graphic?

Repairing the order of the medals

Exercise 4 How do we change the display order to bronze, silver, gold?

Solution 2. Set the domain of the scale.

vb1 |>
  vl_scale_x(domain = c("bronze", "silver", "gold")) |>
  vl_scale_color(domain = c("Norway", "Germany", "Canada"))
chart2 = alt.Chart(medals_wide).transform_fold(
  ["bronze", "silver", "gold"], as_ = ["medal", "count"] 
  ).mark_line(point = True).encode(
    alt.X(field = "medal", type = "ordinal").scale(
      domain = ["bronze", "silver", "gold"]),
    alt.Y(field = "count", type = "quantitative"),
    alt.Color(field = "country", type = "nominal")
    )

chart2.properties(width = 600, height = 150)

Let’s try again

Exercise 5 How else could we achieve this?

Solution 3. We could use sort instead. Sort also allows us to compute an order rather than specify it explicitly.

vb1 |>
  vl_sort_x(c("bronze", "silver", "gold")) |>
  vl_sort_color_by_field("count", op = "sum", order = "descending")
chart3 = alt.Chart(medals_wide).transform_fold(
  ["gold", "silver", "bronze"], as_ = ["medal", "count"]            
  ).mark_line(point = True).encode(
    alt.X("medal:O", sort = ["gold", "silver", "bronze"]),  # sort!
    alt.Y("count:Q"),
    alt.Color(
      "country:N", 
      sort = alt.EncodingSortField(
        field = "count", op = "sum", order = "descending") )
    )

chart3.properties(width = 600, height = 150)
print(chart3.to_json())
{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.20.1.json",
  "config": {
    "view": {
      "continuousHeight": 300,
      "continuousWidth": 300
    }
  },
  "data": {
    "name": "data-fe97747cf726de8a07e6c8ec7fd7c1de"
  },
  "datasets": {
    "data-fe97747cf726de8a07e6c8ec7fd7c1de": [
      {
        "bronze": 11,
        "country": "Norway",
        "gold": 14,
        "silver": 14
      },
      {
        "bronze": 7,
        "country": "Germany",
        "gold": 14,
        "silver": 10
      },
      {
        "bronze": 10,
        "country": "Canada",
        "gold": 11,
        "silver": 8
      }
    ]
  },
  "encoding": {
    "color": {
      "field": "country",
      "sort": {
        "field": "count",
        "op": "sum",
        "order": "descending"
      },
      "type": "nominal"
    },
    "x": {
      "field": "medal",
      "sort": [
        "gold",
        "silver",
        "bronze"
      ],
      "type": "ordinal"
    },
    "y": {
      "field": "count",
      "type": "quantitative"
    }
  },
  "mark": {
    "point": true,
    "type": "line"
  },
  "transform": [
    {
      "as": [
        "medal",
        "count"
      ],
      "fold": [
        "gold",
        "silver",
        "bronze"
      ]
    }
  ]
}

Native Vega-Lite for previous graphic

{
  ...
  "encoding": { 
    "x": { "field": "medal", "type": "ordinal" },
    "y": { "field": "count", "type": "quantitative" },
    "color": { "field": "country", "type": "nominal" } 
  },
  "mark": { "type": "line" },
  "transform": [
    {
      "as": [ "medal", "count" ],
      "fold": [ "gold", "silver", "bronze" ]
    }
  ]
}

Native Vega-Lite for setting scale’s domain

{ ...
  "encoding": {
    "x": {
      "field": "medal",
      "type": "ordinal"
      "scale": { "domain": [ "gold", "silver", "bronze" ]},
    ...
  }, ...
}

Native Vega-Lite for sort

{ ...,
  "encoding": { ..., 
    "x": {
      "field": "medal", "type": "ordinal",
      "sort": [ "gold", "silver", "bronze" ]
    }, ...
}

Simple Tooltips

Big ideas:

  • Tooltips are encodings, just like x, y, color, etc.
  • Provide array of fields to display in the tooltip.

Fancier tooltips are possible with more customization of what is displayed.

vb1 |>
  vl_encode_tooltip_array(c("count:Q", "medal:N", "country:N")) 
chart1.encode(
  alt.Tooltip(["medal:N", "country:N", "count:Q"])
)
chart1$encode(
  alt$Tooltip(c("medal:N", "country:N", "count:Q"))
)

Long \(\to\) Wide

Long data

URL: https://calvin-data304/data/medals-long.csv

country medal count
Norway gold 14
Norway silver 14
Norway bronze 11
Germany gold 14
Germany silver 10
Germany bronze 7
Canada gold 11
Canada silver 8
Canada bronze 10

Converting Long \(\to\) Wide

  • R: dplyr::pivot_wider()
  • Python: pd.pivot() or pd.pivot_table()
  • Vega-Lite: {"transform": [{"pivot": ...}]}
  • vegabrite: vl_pivot()
  • Altair: Chart.transform_pivot()
  • altair: Chart$transform_pivot()

Converting Long \(\to\) Wide

medals_long |>
  pivot_wider(names_from = "medal", values_from = "count") |>
  pander::pander()
country gold silver bronze
Norway 14 14 11
Germany 14 10 7
Canada 11 8 10
vl_chart() |>
  # note required use of list() here!
  vl_pivot(pivot = "medal", groupby = list("country"), value = "count") |>
  vl_mark_bar(fill = "gold") |>
  vl_encode_x("gold:Q") |>
  vl_encode_y("country:N") |>
  vl_add_data(medals_long) 
alt.Chart(medals_long).transform_pivot(
    "medal",
    groupby=["country"],
    value="count"
).mark_bar(fill = "gold").encode(
    x="gold:Q",
    y="country:N",
)
{ ...,
  "transform": [
    {
      "groupby": [ "country" ],
      "pivot": "medal",
      "value": "count"
    }
  ], ...
}

Another example: Faceting via repeat

vl_chart() |>
  vl_pivot("medal", groupby = list('country'), value = "count") |>
  vl_repeat_col("gold", "silver") |>
  vl_mark_bar() |>
  vl_encode_x(field = list(`repeat` = "column"), type = "quantitative") |>
  vl_encode_y("country:N") |>
  vl_add_data(medals_long) 
alt.Chart(medals_long).transform_pivot(
    "medal",
    groupby=["country"],
    value="count"
).mark_bar().encode(
    alt.X(field = alt.repeat("column"), type = "quantitative"),
    y="country:N",
).repeat(
  column = ["gold", "silver"]
)

Exercise 6 How else could we do something similar?

Filter and facets

We could also do this with filter and facets.

vl_chart() |>
  vl_mark_bar() |>
  vl_filter("datum.medal == 'gold' | datum.medal == 'silver'") |>
  vl_encode_column("medal") |>
  vl_encode_x("count:Q") |>
  vl_encode_y("country:N") |>
  vl_add_data(medals_long)

Your turn

Data URL: <“https://cdn.jsdelivr.net/npm/vega-datasets@2.8.0/data/jobs.json”>

Exercise 7 Use these data to create some visualizations

  • Scatterplot showing the percent of men and women in various occupations 1950 and 2000.

    • Add a tooltip
    • Bonus: Add interaction to select the two years.
  • Time series plot showing the percent of men and women in different occupations over time.

    • Start with all the occupations, then decide if you want to restrict your attention to just some occupations.
  • Create additional graphics with these data.

Examples

Joins: Using data from multiple sources

Some data sets

Each of the following data sets contains two columns. The first column is country and the second is one of the following.

The data come from the CIA World Factbook and are 10-20 years old.

Example use case: a scatter plot

Suppose you want to make a scatter plot for two of these measures.

  • Problem: The variables are in different data sets.

  • Solution: Join the data sets together so we have a data set with 3 columns: country and two of these measures, like this:

    country obesity GDP
    American Samoa 74.6 575300000
    Nauru 71.1 6e+07
    Cook Islands 63.7 183200000

Joining two data sets

Let’s call them left and right because one will be written first (left) and the other second (right) when we join them in code.

Once we have selected the data sets, we still have two decisions to make:

  1. What column(s) will be used to identify rows that “match”?

  2. What will we do with rows in one data set that don’t have any matches in the other data set?

Types of joins

The most important joins for us are:

  1. inner join: keep only rows that have a match
  2. left (outer) join: keep all rows from left, fill in with NA when there is no match in right
  3. right (outer) join: keep all rows from right, fill in with NA when there is no match in left
  4. full (outer) join: include all rows from both, filling in NA when things don’t match

For data visualization purpposes, a left join is very common.

  • only one implemented in Vega-Lite, where it is called a “lookup” transformation.
  • can be used to “lookup” additional information to add to your graphic.

Visualizing Joins

This page has some nice annimations of join operations.

This shiny app includes a nice visualization of how joins work.

Implementing a join

country obesity GDP
American Samoa 74.6 575300000
Nauru 71.1 6e+07
Cook Islands 63.7 183200000

By the way:

  • by can be a vector of column names
  • If the names don’t match, we can use by = c("left name" = "right name")
  • If you don’t specify by, then all columns with matching names are used.

See also <

vb_countries <-
  vl_chart() |>
  vl_lookup(
    lookup = "country",    # by-variable in primary data
    from = list(
      data = list(url = "https://calvin-data304.netlify.app/data/GDP.csv"),
      key = "country",
      fields = list("GDP")
    )) |>
  vl_mark_point() |>
  vl_encode_x("GDP:Q") |>
  vl_encode_y("obesity:Q") |>
  vl_add_properties(width = 600, height = 150) |>
  vl_add_data_url("https://calvin-data304.netlify.app/data/obesity.csv")
  
vb_countries
data_url <- "https://calvin-data304.netlify.app/data/obesity.csv"
lookup_url <- "https://calvin-data304.netlify.app/data/GDP.csv"

alt$Chart(data_url)$
  transform_lookup(
    lookup = "country",    # by-variable in primary data
    from_ = alt$LookupData(
      data = list(url = lookup_url),
      key = "country",
      fields = list("GDP")
    ))$
  mark_point(color = "purple")$
  encode(
    x = "GDP:Q",
    y = "obesity:Q"
    )$
  properties(width = 600, height = 150) 

See also https://altair-viz.github.io/user_guide/transform/lookup.html

data_url = "https://calvin-data304.netlify.app/data/obesity.csv"
lookup_url = "https://calvin-data304.netlify.app/data/GDP.csv"
(
alt.Chart(data_url)
  .transform_lookup(
    lookup = "country",    # by-variable in primary data
    from_ = alt.LookupData(
      data = {"url": lookup_url},
      key = "country",
      fields = ["GDP"]
    ))
  .mark_point(color = "red") 
  .encode(
    x = "GDP:Q",
    y = "obesity:Q"
    )
  .properties(width = 600, height = 150) 
)
vb_countries |> format()
{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "transform": [
    {
      "lookup": "country",
      "from": {
        "data": {
          "url": "https://calvin-data304.netlify.app/data/GDP.csv"
        },
        "key": "country",
        "fields": [
          "GDP"
        ]
      }
    }
  ],
  "mark": {
    "type": "point"
  },
  "encoding": {
    "x": {
      "field": "GDP",
      "type": "quantitative"
    },
    "y": {
      "field": "obesity",
      "type": "quantitative"
    }
  },
  "height": 150,
  "width": 600,
  "data": {
    "url": "https://calvin-data304.netlify.app/data/obesity.csv"
  }
} 

Your turn

Exercise 8 Make the following modifications to the obesity vs GDP plot.

  1. Add a tooltip so you can tell which contry is which.

  2. Use a logrithmic scale where appropriate.

  3. Format the GDP axis using “.0s” as the format string.

  4. Color or facet the points by continent. (You will need to use this data or something similar to get the region information.

    You might notice something unexpected in this plot. What is it? Why is it happening? What could be done to fix it?

Exercise 9 Create a graphic using a different pair of variables.