I have a lot of R functions that I use which don’t necessarily each warrant creating their own package. I’m also not confident that I’m implementing these in the best possible way, or in a way that hasn’t already been implemented in some package that I just don’t know about.
But since I don’t package them, I end up just forgetting about them until the next time I need them, at which point I rewrite them but with subtle variations. This is confusing and bad, so I’m now putting them together into a package.
I can’t recommend doing this highly enough. Having a set of tools that implements recurring operations has really sped up my ability to do quick data analysis. Different people working in different environments will have different recurring problems, and so the most popular packages probably won’t have stuff that’s perfectly tailored for your workflow. It’s also just fun, if you like to program.
Here are the few things I’ve contributed to this package thus far.
Over the years I’ve decided that there is in fact a best ggplot2 theme, and that is
penguins <- palmerpenguins::penguins p <- penguins %>% ggplot(aes(x = bill_length_mm, y = body_mass_g, color = species)) + geom_point() + labs(title = 'Penguin body mass by bill length', subtitle = 'Separated by species', caption = "Dataset: https://github.com/allisonhorst/palmerpenguins.", x = 'Bill Length', y = 'Body Mass', color = 'Species') + scale_x_continuous(labels = label_number(scale_cut = cut_si('mm'))) + scale_y_continuous(labels = label_number(scale_cut = cut_si('g'))) p + ggthemes::theme_few()
However, there are a few recurring changes that I tend to make to this. First, I like a bold title. Second though, although I like the appearance of no grid lines, sometimes it’s not appropriate. Depending on the visualization, you might want a grid—or you might want only horizontal or vertical grid lines. So I have a theme
theme1 which implements this, with the first argument specifying which grid direction to apply.
p + theme1() # default, prints the whole grid
p + theme1(grid_type = 'horizontal') # I find this mainly useful for time series
p + theme1(grid_type = 'vertical') # I don't really ever use this, but included for completeness
p + theme1(grid_type = 'none')
Annotating key points directly on a plot is underrated. I’ve often found that labeling an important point on a plot is so much more effective than trying to explain to people what they’re supposed to be looking for. For that reason, I started developing some functions to make these annotations quickly.
These are in a set of functions called
stat_annotate_*. They’re very straightforward to use, and I’ve already gotten a lot of mileage out of them in the short time since I actually sat down and coded them.
p + theme1() + stat_annotate_max() + labs(subtitle = 'Separated by species. Highest weights by group are indicated.')
All of the usual text attributes can be passed to these functions, and an argument called
labeler provides a labeling function to the annotation.
p + theme1() + stat_annotate_max(vjust = .5, size = 5, nudge_x = 1, family = 'serif', fontface = 'bold.italic', labeler = label_number(scale_cut = cut_si("g"), accuracy = .01), geom = 'label') + labs(subtitle = 'Separated by species. Highest weights by group are indicated.')
Since it’s implemented as a
stat_ function, you can also use any other geom—including ones which aren’t text based, which is sometimes useful as well.
p + theme1() + stat_annotate_max(geom = "point", size = 4) + labs(subtitle = 'Separated by species. Highest weight by species is marked by a larger point.')
Right now there are 4
stat_annotate_ functions implemented.
The last two are mostly useful for time series.
labeler <- label_number(scale_cut = cut_short_scale(), accuracy = .1) ggplot2::economics %>% ggplot(aes(x = date, y = unemploy)) + geom_line() + stat_annotate_first(vjust=-.1, hjust = 1, labeler = labeler) + stat_annotate_last(hjust = 0, vjust=1, labeler = labeler) + stat_annotate_max(labeler = labeler, hjust=1.1) + theme1('h')
Eventually I want to build in some smarter logic for placing the labels so that they don’t overlap with the chart too much, but for now it’s not too hard to just mess with the
nudge_y parameters to place them where you want.
Slicing and Dicing
Something I find myself frequently needing to do is to take a dataset with categorical variables and filter it to include only the categories that occur the most frequently. For example, maybe I want to look at only the island with the most observations from the
penguins dataset. Enter
penguins %>% slice_top_categories(1, island)
## # A tibble: 168 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Biscoe 37.8 18.3 174 3400 ## 2 Adelie Biscoe 37.7 18.7 180 3600 ## 3 Adelie Biscoe 35.9 19.2 189 3800 ## 4 Adelie Biscoe 38.2 18.1 185 3950 ## 5 Adelie Biscoe 38.8 17.2 180 3800 ## 6 Adelie Biscoe 35.3 18.9 187 3800 ## 7 Adelie Biscoe 40.6 18.6 183 3550 ## 8 Adelie Biscoe 40.5 17.9 187 3200 ## 9 Adelie Biscoe 37.9 18.6 172 3150 ## 10 Adelie Biscoe 40.5 18.9 180 3950 ## # … with 158 more rows, and 2 more variables: sex <fct>, year <int>
There is an argument for specifying a weight variable to use in the counts as well. So if I wanted the “heaviest” species, I could go like this.
penguins %>% slice_top_categories(1, species, .wt = body_mass_g)
## # A tibble: 124 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Gentoo Biscoe 46.1 13.2 211 4500 ## 2 Gentoo Biscoe 50 16.3 230 5700 ## 3 Gentoo Biscoe 48.7 14.1 210 4450 ## 4 Gentoo Biscoe 50 15.2 218 5700 ## 5 Gentoo Biscoe 47.6 14.5 215 5400 ## 6 Gentoo Biscoe 46.5 13.5 210 4550 ## 7 Gentoo Biscoe 45.4 14.6 211 4800 ## 8 Gentoo Biscoe 46.7 15.3 219 5200 ## 9 Gentoo Biscoe 43.3 13.4 209 4400 ## 10 Gentoo Biscoe 46.8 15.4 215 5150 ## # … with 114 more rows, and 2 more variables: sex <fct>, year <int>
I can use this for arbitrarily many columns as well, so if I wanted the 2 most frequently occurring (species, sex) pair weighted by body mass, I could go like this.
penguins %>% slice_top_categories(2, species, sex, .wt = body_mass_g)
## # A tibble: 134 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.3 20.6 190 3650 ## 3 Adelie Torgersen 39.2 19.6 195 4675 ## 4 Adelie Torgersen 38.6 21.2 191 3800 ## 5 Adelie Torgersen 34.6 21.1 198 4400 ## 6 Adelie Torgersen 42.5 20.7 197 4500 ## 7 Adelie Torgersen 46 21.5 194 4200 ## 8 Adelie Biscoe 37.7 18.7 180 3600 ## 9 Adelie Biscoe 38.2 18.1 185 3950 ## 10 Adelie Biscoe 38.8 17.2 180 3800 ## # … with 124 more rows, and 2 more variables: sex <fct>, year <int>
That’s all I’ve got for now, but this handful of functions has already gotten a ton of mileage in my day-to-day work.