colinlib - My own personal R package

I have a lot of R functions that I use which don’t necessarily each warrant creating their own package. I’m also not confident that I’m implementing these in the best possible way, or in a way that hasn’t already been implemented in some package that I just don’t know about.

But since I don’t package them, I end up just forgetting about them until the next time I need them, at which point I rewrite them but with subtle variations. This is confusing and bad, so I’m now putting them together into a package.

I can’t recommend doing this highly enough. Having a set of tools that implements recurring operations has really sped up my ability to do quick data analysis. Different people working in different environments will have different recurring problems, and so the most popular packages probably won’t have stuff that’s perfectly tailored for your workflow. It’s also just fun, if you like to program.

Here are the few things I’ve contributed to this package thus far.

Plotting

Theming

Over the years I’ve decided that there is in fact a best ggplot2 theme, and that is theme_few from ggthemes.

penguins <- palmerpenguins::penguins
p <- penguins %>% 
  ggplot(aes(x = bill_length_mm, y = body_mass_g, color = species)) +
  geom_point() + 
  labs(title = 'Penguin body mass by bill length', 
       subtitle = 'Separated by species',
       caption = "Dataset: https://github.com/allisonhorst/palmerpenguins.",
       x = 'Bill Length', y = 'Body Mass', color = 'Species') + 
  scale_x_continuous(labels = label_number(scale_cut = cut_si('mm'))) +
  scale_y_continuous(labels = label_number(scale_cut = cut_si('g')))
p + ggthemes::theme_few()

However, there are a few recurring changes that I tend to make to this. First, I like a bold title. Second though, although I like the appearance of no grid lines, sometimes it’s not appropriate. Depending on the visualization, you might want a grid—or you might want only horizontal or vertical grid lines. So I have a theme theme1 which implements this, with the first argument specifying which grid direction to apply.

p + theme1() # default, prints the whole grid

p + theme1(grid_type = 'horizontal') # I find this mainly useful for time series

p + theme1(grid_type = 'vertical') # I don't really ever use this, but included for completeness

p + theme1(grid_type = 'none')

Annotations

Annotating key points directly on a plot is underrated. I’ve often found that labeling an important point on a plot is so much more effective than trying to explain to people what they’re supposed to be looking for. For that reason, I started developing some functions to make these annotations quickly.

These are in a set of functions called stat_annotate_*. They’re very straightforward to use, and I’ve already gotten a lot of mileage out of them in the short time since I actually sat down and coded them.

p +
  theme1() + 
  stat_annotate_max() + 
  labs(subtitle = 'Separated by species. Highest weights by group are indicated.')

All of the usual text attributes can be passed to these functions, and an argument called labeler provides a labeling function to the annotation.

p + theme1() + 
  stat_annotate_max(vjust = .5, size = 5, nudge_x = 1, 
                    family = 'serif', fontface = 'bold.italic',
                    labeler = label_number(scale_cut = cut_si("g"), accuracy = .01),
                    geom = 'label') + 
  labs(subtitle = 'Separated by species. Highest weights by group are indicated.')

Since it’s implemented as a stat_ function, you can also use any other geom—including ones which aren’t text based, which is sometimes useful as well.

p + theme1() + 
  stat_annotate_max(geom = "point", size = 4) +
  labs(subtitle = 'Separated by species. Highest weight by species is marked by a larger point.')

Right now there are 4 stat_annotate_ functions implemented.

  • stat_annotate_max
  • stat_annotate_min
  • stat_annotate_first
  • stat_annotate_last

The last two are mostly useful for time series.

labeler <- label_number(scale_cut = cut_short_scale(), accuracy = .1)
ggplot2::economics %>% 
  ggplot(aes(x = date, y = unemploy)) +
  geom_line() +
  stat_annotate_first(vjust=-.1, hjust = 1, labeler = labeler) + 
  stat_annotate_last(hjust = 0, vjust=1, labeler = labeler) +
  stat_annotate_max(labeler = labeler, hjust=1.1) +
  theme1('h')

Eventually I want to build in some smarter logic for placing the labels so that they don’t overlap with the chart too much, but for now it’s not too hard to just mess with the vjust and hjust or nudge_x and nudge_y parameters to place them where you want.

Slicing and Dicing

Something I find myself frequently needing to do is to take a dataset with categorical variables and filter it to include only the categories that occur the most frequently. For example, maybe I want to look at only the island with the most observations from the penguins dataset. Enter slice_top_categories.

penguins %>% 
  slice_top_categories(1, island)
## # A tibble: 168 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Biscoe           37.8          18.3               174        3400
##  2 Adelie  Biscoe           37.7          18.7               180        3600
##  3 Adelie  Biscoe           35.9          19.2               189        3800
##  4 Adelie  Biscoe           38.2          18.1               185        3950
##  5 Adelie  Biscoe           38.8          17.2               180        3800
##  6 Adelie  Biscoe           35.3          18.9               187        3800
##  7 Adelie  Biscoe           40.6          18.6               183        3550
##  8 Adelie  Biscoe           40.5          17.9               187        3200
##  9 Adelie  Biscoe           37.9          18.6               172        3150
## 10 Adelie  Biscoe           40.5          18.9               180        3950
## # … with 158 more rows, and 2 more variables: sex <fct>, year <int>

There is an argument for specifying a weight variable to use in the counts as well. So if I wanted the “heaviest” species, I could go like this.

penguins %>% 
  slice_top_categories(1, species, .wt = body_mass_g)
## # A tibble: 124 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Gentoo  Biscoe           46.1          13.2               211        4500
##  2 Gentoo  Biscoe           50            16.3               230        5700
##  3 Gentoo  Biscoe           48.7          14.1               210        4450
##  4 Gentoo  Biscoe           50            15.2               218        5700
##  5 Gentoo  Biscoe           47.6          14.5               215        5400
##  6 Gentoo  Biscoe           46.5          13.5               210        4550
##  7 Gentoo  Biscoe           45.4          14.6               211        4800
##  8 Gentoo  Biscoe           46.7          15.3               219        5200
##  9 Gentoo  Biscoe           43.3          13.4               209        4400
## 10 Gentoo  Biscoe           46.8          15.4               215        5150
## # … with 114 more rows, and 2 more variables: sex <fct>, year <int>

I can use this for arbitrarily many columns as well, so if I wanted the 2 most frequently occurring (species, sex) pair weighted by body mass, I could go like this.

penguins %>% 
  slice_top_categories(2, species, sex, .wt = body_mass_g)
## # A tibble: 134 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.3          20.6               190        3650
##  3 Adelie  Torgersen           39.2          19.6               195        4675
##  4 Adelie  Torgersen           38.6          21.2               191        3800
##  5 Adelie  Torgersen           34.6          21.1               198        4400
##  6 Adelie  Torgersen           42.5          20.7               197        4500
##  7 Adelie  Torgersen           46            21.5               194        4200
##  8 Adelie  Biscoe              37.7          18.7               180        3600
##  9 Adelie  Biscoe              38.2          18.1               185        3950
## 10 Adelie  Biscoe              38.8          17.2               180        3800
## # … with 124 more rows, and 2 more variables: sex <fct>, year <int>

That’s all I’ve got for now, but this handful of functions has already gotten a ton of mileage in my day-to-day work.