SAS and R

Monday, August 27, 2018

Project MOSAIC migrates to ggformula

guest entry by Randall Pruim

In 2017, Project MOSAIC announced ggformula, a new package that provides a formula interface to ggplot2 graphics in R. (See, for example, ggformula: another option for teaching graphics in R to beginners.) This package provides a happy medium between lattice and ggplot2 that allows beginners to “do powerful things quickly” by adopting the formula interface of lattice and R’s statistical modeling functions as a means to produce ggplot2 graphics.

Over the past year, our experience with ggformula in our classes and in faculty development workshops together with the feedback we have received from other users have demonstrated ggformula to be flexible, yet easy to learn. As part of an ecosystem that emphasizes a formula interface of lattice and the core R statistical modeling functions early on and adds tidyverse concepts later, ggformula fits better with the rest of our toolkit than do either lattice or ggplot2, providing opportunities for more creativity with less volume.

The recent releases of several Project MOSAIC R packages (mosaic, mosaicData, mosaicCore, and ggformula) and the related fastR2 package mark the official migration of Project MOSAIC from lattice to ggformula as its primary graphics system. Future development includes plans to release an updated version of mosaicModel which will interoperate with ggformula and a new package called ggformulaExtra (currently only available via Github) which adds additional functionality but relies on additional packages beyond ggplot2.

Many of the recent changes to the Project MOSAIC suite of packages will go largely unnoticed by most users but were necesary to allow ggformula to interoperate with the newest version of ggplot2. Among the small number of more noticeable changes are a change in gf_smooth() so that it no longer displays confidence bands by default (use se = TRUE to turn them on), expanded support for “rugs”, support for horizontal versions of histograms, boxplots, and violin plots (using the ggstance package), and the addition of gf_sf() for improved support for choropleth maps (based on the new geom_sf() in ggplot2). Along the way, we also did some light housekeeping (improving documentation, etc.) and migrated most of our package examples from lattice to ggformula.

The basic form of the formula interface is

goal(y ~ x, data = myData)

which corresponds to SAS code like

PROC GOAL DATA = MYDATA; MODEL Y = X; RUN;

goal() can be replaced by a graphing (e.g., gf_point()) or modeling (e.g., lm()) function with the number of variables involved in the formula varying with the complexity of the plot or model desired.

library(mosaic)              # load the mosaic package (and ggformula)
gf_point(length ~ width, data = KidsFeet)                  # scatter plot 
      lm(length ~ width, data = KidsFeet) %>% msummary()   # linear model

##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.8172     2.9381   3.341  0.00192 ** 
## width         1.6576     0.3262   5.081  1.1e-05 ***
## 
## Residual standard error: 1.025 on 37 degrees of freedom
## Multiple R-squared:  0.411,  Adjusted R-squared:  0.3951 
## F-statistic: 25.82 on 1 and 37 DF,  p-value: 1.097e-05

Users of lattice-based Project MOSAIC materials should have little trouble migrating to ggformula since the types of plots that were easiest to construct with lattice can be created very similarly using ggformula. For example, the following two commands are essentially equivalent (although the resulting plots have a different appearence).

    histogram( ~ age | sex, data = HELPrct,    width = 2, col  = "navy")
gf_dhistogram( ~ age | sex, data = HELPrct, binwidth = 2, fill = "navy")

It is much simpler, however, to create complex plots using ggformula because multiple layers can be stacked using the maggrittr pipe (%>%, which we often read as “then”) familiar to users of the tidyverse suite of packages (and many others as well).

gf_jitter(Sepal.Length ~ Sepal.Width, data = iris, color = ~ Species) %>%
  gf_density2d(alpha = 0.4) %>%
  gf_jitter(geom = "rug", alpha = 0.7) %>%
  gf_lm(linetype = "dashed") %>%
  gf_refine(scale_color_brewer(type = "qual"))

As part of the migration to ggformula, a number of related resources have been or are being converted from lattice to ggformula as well. These include companion volumes for several popular statistics text books, our series of “Little Books”, the Minimal R Vignette, and a side-by-side comparison of lattice and ggformula. In addition, the second edition of Foundations and Applications of Statistics (Pruim, 2018) uses ggformula throughout.

An eventual migration from ggformula to native ggplot2, while not strictly necessary (since the same plots can be made in either system), is easier than the migration from lattice since the underlying grammar and much of the nomenclature of ggformula is borrowed from ggplot2. In the meantime, equivalent ggformula code is generally less verbose and simpler for novices to understand and produce. And the use of %>% for layering avoids the errors that creap in when moving between tidyverse, which also uses %>%, and ggplot2 which uses +. Indeed, data flows can be directed seamlessly into ggformula plotting commands. This can be useful as a debugging step when creating data pipelines or as a way to create a plot for which there is no need to save the pre-processed data.

Galton %>%
  filter(sex == "M") %>%  # select only male adult children
  group_by(family) %>%      #
  sample_n(1) %>%           # choose only one male from each family
  ungroup %>%               #
  mutate(                     # compute z-scores for parents' heights
    zfather = round(mosaic::zscore(father), 2),
    zmother = round(mosaic::zscore(mother), 2)
  ) %>% 
  gf_jitter(zfather ~ zmother, alpha = 0.5, 
            title = "Standardized heights of parents",
            caption = "Source: Galton") %>%
  gf_lm()

It has been over a year since I have used either lattice or ggplot2 for anything other than comparison examples. My co-authors and I have found the switch from lattice to ggformula to be both straightforward (for us) and advantageous (for our students). We encourage you to give it a try in your own work and with your students.

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

SAS and R

Catalogs of posts

Monday, August 27, 2018

Project MOSAIC migrates to ggformula

Project MOSAIC migrates to ggformula

guest entry by Randall Pruim

About SAS and R

Topics discussed