HW 4: Reading, Drawing, and Vega-Lite

Submit your work for this assignment using this form

Exercise 1 Read page 51-67 of Knaflic (2020). Then pause.

Exercise 2 Do Exercise 2.3 on page 68. You will need some blank paper (or an e-tablet) and your phone to use as 10-minute timer. Follow the directions and save your work, you will be turning it and you will need if for something else.

Exercise 3 Compare your sketches to the sketches on page 69. Did you come up with any similar graphics? Where do your ideas differ? Which do you like best out of the full group (yours plus the ones in the book)?

Exercise 4 Which of these graphics (yours and the ones from the book) would you know how to make with Vega-Lite right now? For the ones you wouldn’t be able to make yet, identify the features that we still need to learn.

Exercise 5 Pick one graphic that you can make in Vega-Lite (your choice) and do it. Access the data in CSV or JSON format using one of these links:

CSV and
JSON formats.

Exercise 6 In 2024 I received an email from a colleague at another instituion:

Students in the biology department have been running a study on identical twins, testing the different DNA kits that are on the market today. In theory, the DNA kits of identical twins should yield identical results, but often don’t. The data they sent me is in the screenshot below.

I’ve copied the data to two CSV files (same data in two different formats) and two JSON files. You may use whichever makes your graphic easier to make.

Create two visualizations of this data, (a) one that draws attention to similarities/differences between twins and (b) one that draws attention to similarities/differences between DNA kits.

Data notes:

There are 12 people in this data set: 1A, 1B, 2A, 2B, …, 6A, 6B. 1A and 1B are twins, 2A and 2B are twins, etc. Three different gentics kits were used, and each reports a percentage associated with various world regions. Roughly this is an estimate of what percentage of a person’s ancestors many generations back came from various parts of the world. Differences between kits could occur because one lab is more accurate than another, but could also occur if they based their estimate on different subsets of a person’s DNA. Some people are a mix of more regions, so they have more rows in the data sets. You can assume that any missing regions contribute 0.
The main difference between the two data formats has to do with how kit is handled. In the long format there is a column named kit that contains the name of the kit, and a column named genetic share that contains the proportions described in part a. In the wide format, there is a column for each kit containing the relavant proportion. The long format has 3 times as many rows because there are 3 kits.
Converting between “wide” and “long” formats is a common step in preparing data for graphics because some graphics are easier to produce using one format, and some using the other. Since we haven’t discussed yet how to do these transformations, I’ve provided both formats. Use whichever works best for your graphics – possibly one format for one graphic and one for the other, depending on the graphics you choose to make.

Exercise 7 Do exercise 2.12 from Knaflic (2020)

References

Knaflic, C. N. 2020. Storytelling with Data: Let’s Practice! Wiley. https://github.com/Saurav6789/Books-/blob/master/Storytelling%20with%20Data%20Let%E2%80%99s%20Practice%20by%20Cole%20Nussbaumer%20Knaflic%20(z-lib.org).pdf.