R and Python interfaces to Vega-Lite

Data 304: Visualizing Data and Models

Advantages to working in R or Python

  1. You probably already know it and have a workflow.

  2. Many more tools for data wrangling, modeling, etc.

  3. Syntax advantages, easier to reuse code.

  4. Some parts of the JSON creation can be automated (less typing).

  5. You are probably going to be using R or Python for other parts of your work anyway.

Some packages used in these slides

R

library(reticulate)  # to use R and Python in the same Quarto file
library(vegawidget)
library(vegabrite)
library(altair)

Python

import altair as alt
from vega_datasets import data

Three options for Vega-Lite in R

  1. vegawidget::as_vegaspec()

    Two ways to create the Vega-Lite specification:

    1. Write JSON as string

    2. Write “list-of-lists” version of JSON spec

  2. vegabrite

    • Most “R-like”, but unknown support level going foward
  3. altair

    • Built on top of Python package – code looks like Python translated to R.

What is vegabrite?

From the vegabrite github site:

The goal of vegabrite is to provide an R api for building up vega-lite specs… This package is still experimental but has a mostly complete interface for building out Vega-Lite specs and charts. There is still lots of room for improvement in terms of better error handling and warnings when making invalid specs… Much of the public API is auto-generated

An example

Code
vega_data <- import_vega_data()

vl_chart(width = 300, height = 300) |>
  vl_mark_point() |>
  vl_encode_x("Horsepower:Q") |>
  vl_encode_y("Miles_per_Gallon:Q") |>
  vl_encode_color("Origin:N") |>
  vl_add_data(vega_data$cars())    # <--- Note the parens after cars!

Basic structure

Full list of functions

vl_ functions create/modify parts of Vega-Lite specification.

  • groups of functions with similar 3-part names add/modify components of spec.

    • vl_mark_<marktype>()
    • vl_encode_<channel>()
    • vl_sort_<channel>_by_encoding(), vl_sort_<channel>_by_field()
    • vl_scale_<channel>(), vl_legend_<channel>()
    • vl_axis_<x|y>, vl_remove_axis_<x|y>()
    • vl_facet_<|row|col>()
    • vl_repeat<layer|col|row|wrap>()
    • vl_config<element to configure>() etc.
  • transform functions don’t use word “transform”

    • vl_calculate(), vl_fold(), vl_lookup(), vl_aggregate_<channel>(), etc.
  • layering: +, vl_layer(),

  • concatenation: |, &, vl_hconcat(), vl_vconcat(), vl_concat()

  • facets: vl_facet(), vl_facet_row(), vl_facet_column(),

  • repeat: vl_repeat_layer(), vl_repeat_row(), vl_repeat_col(), vl_repeat_wrap()

  • config: vl_conig_<thing to configure>() – lots of these

What is Vega-Altair (Python)?

From documentation:

Vega-Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite.

It offers a powerful and concise grammar that enables you to quickly build a wide range of statistical visualizations.

From me:

It is a Pythonification of the Vega-Lite JSON spceification.

  • Starting point: alt.Char()
  • Uses method chaining (.) to add elements to the specification.

Simple Vega-Altair example

Code
cars = data.cars()

# make the chart
alt.Chart(cars).mark_point().encode(
    x = 'Horsepower',
    y = 'Miles_per_Gallon',
    color = 'Origin')

What does interactive() do?

alt.Chart(cars).mark_point().encode(
    x = 'Horsepower',
    y = 'Miles_per_Gallon',
    color = 'Origin',
).interactive()

interactive([name, bind_x, bind_y]) | Make chart axes scales interactive.

We can inspect the JSON and see that it inserts

 "params": [
    {
      "name": "param_1",
      "select": {"type": "interval", "encodings": ["x", "y"]},
      "bind": "scales"
    }
  ]

We can get the same effect in vegabrite by coding up this binding ourselves.

vl_chart(width = 300, height = 300) |>
  vl_mark_point() |>
  vl_encode_x("Horsepower:Q") |>
  vl_encode_y("Miles_per_Gallon:Q") |>
  vl_encode_color("Origin:N") |>
  vl_add_data(vega_data$cars()) |>
  vl_add_interval_selection(
    name = "param_1", select = "x,y", bind = "scales"
  )

Python syntax and line breaks

The Python style guide (PEP 8) recommends using implicit line continuation. An implicit line continuation happens whenever Python gets to the end of a line of code and sees that there’s more to come because a parenthesis ((), square bracket ([) or curly brace ({) has been left open.

This is sometimes clunky for altair code (and for method chaining in general).

One trick: Enclose the whole thing in parens; break lines at methods.

( 
alt.Chart(cars)
    .mark_point()
    .encode( x = 'Horsepower', y = 'Miles_per_Gallon', color = 'Origin' )
    .interactive()
)

What is altair (R)?

This package uses reticulate to provide an interface to the Altair Python package, and the vegawidget package to render charts as htmlwidgets.

In other words: An R wrapper around the Python package.

Example in altair

cars = data.cars()

# make the chart
alt.Chart(cars).mark_point().encode(
    x = 'Horsepower',
    y = 'Miles_per_Gallon',
    color = 'Origin')
alt$Chart(vega_data$cars())$
  mark_point()$
  encode(
    x = "Horsepower:Q",
    y = "Miles_per_Gallon:Q",
    color = "Origin:N"
  )$
  interactive()

Learning more

Which system should I use?

  • vegabrite
    • nicest R syntax
    • unclear how well it is supported
  • Altair/altair
    • very similar, so you learn a Python way and get an R way for free
    • Altair seems to be actively supported.
    • altair is derivative, so it needs less maintenance, but that support also seems to lag a bit (e.g., CRAN version is using an outdated version of Altair).

More Examples

Weather data

library(dplyr)
Weather <- mosaicData::Weather |> mutate (
  year = lubridate::year(date),
  month = lubridate::month(date),
  day = lubridate::day(date)
)
Weather |> head(3) |> pander::pander()
Table continues below
city date year month day high_temp avg_temp low_temp
Auckland 2016-01-01 2016 1 1 68 65 62
Auckland 2016-01-02 2016 1 2 68 66 64
Auckland 2016-01-03 2016 1 3 77 72 66
Table continues below
high_dewpt avg_dewpt low_dewpt high_humidity avg_humidity
64 60 55 100 82
64 63 61 100 94
70 67 64 100 91
Table continues below
low_humidity high_hg avg_hg low_hg high_vis avg_vis low_vis
68 30.15 30.09 30.01 6 6 4
88 30.04 29.9 29.8 6 5 1
74 29.8 29.73 29.68 6 6 1
high_wind avg_wind low_wind precip events
21 15 28 0 Rain
33 21 46 0 Rain
18 12 NA 0 Rain
Weather = r.Weather    # import data from R session
Weather.head(3)
       city        date    year  month  ...  avg_wind  low_wind  precip  events
0  Auckland  2016-01-01  2016.0    1.0  ...      15.0      28.0       0    Rain
1  Auckland  2016-01-02  2016.0    1.0  ...      21.0      46.0       0    Rain
2  Auckland  2016-01-03  2016.0    1.0  ...      12.0       NaN       0    Rain

[3 rows x 25 columns]

It seems that vegabrite uses a different method to pass dates along to JSON.

In vegabrite, we can use Weather$date as is.

In altair, we need to remove that column, after extracting the year, month, and day, and then use a calculate transform to recreate the date from those. If we don’t, we get errors about JSON serialization of date/datetime objects.

Working with dates, times, and timezones is often one of the more challenging aspects of data analysis. In Altair, the difficulties are compounded by the fact that users are writing Python code, which outputs JSON-serialized timestamps, which are interpreted by Javascript, and then rendered by your browser. At each of these steps, there are things that can go wrong, but Altair and Vega-Lite do their best to ensure that dates are interpreted and visualized in a consistent way.

Weather graphic

Weather = r.Weather.drop('date', axis = 1)
(alt.Chart(Weather, width = 800, height = 55)
    .mark_area()
    .transform_calculate(date = "datetime(datum.year, datum.month, datum.day)")
    .encode(
        x = alt.X("date:T", title = ""),
        y = alt.Y("high_temp:Q", title = "temperature"),
        y2 = "low_temp:Q",
        row = "city:N")
)
alt$Chart(Weather |> select(-date), width = 800, height = 55)$
    mark_area()$
    transform_calculate(date = "datetime(datum.year, datum.month, datum.day)")$
    encode(
        x = alt$X("date:T", title = ""),
        y = alt$Y("high_temp:Q", title = "temperature"),
        y2 = "low_temp:Q",
        row = "city:N")
vl_chart(width = 800, height = 55) |>
  vl_mark_area() |>
  vl_encode_x("date:T", title = "") |>
  vl_encode_y("high_temp:Q", title = "temperature") |>
  vl_encode_y2("low_temp:Q") |>
  vl_facet_row("city:N", title = "") |>
  vl_add_data(Weather) |>
  vl_add_properties(title = "High and low temperaturs in several cities")

Another example