Code style guide

The following introductory section is taken, in slightly adapted form, from Jae Yeon Kim’s “Computational Thinking for Social Scientists.” It is reproduced here for ease of access. To consult the full book, go to https://jaeyk.github.io/comp_thinking_social_science/.

The Big Picture

  • What is code style?

Every major open-source project has its style guide: a set of conventions (sometimes arbitrary) about writing code for that project. It is much easier to understand a large codebase when all the code in it is in a consistent style. - Google Style Guides

10 Tips For Clean Code - Michael Toppa

Code smells and feels - Jenny Bryan

"Code smell" is an evocative term for that vague feeling of unease we get when reading certain bits of code. It's not necessarily wrong, but neither is it obviously correct. We may be reluctant to work on such code because past experience suggests it's going to be fiddly and bug-prone. In contrast, there's another type of code that just feels good to read and work on. What's the difference? If we can be more precise about code smells and feels, we can be intentional about writing code that is easier and more pleasant to work on. I've been fortunate to spend the last couple of years embedded in a group of developers working on the tidyverse and r-lib packages. Based on this experience, I'll talk about specific code smells and deodorizing strategies for R. - Jenny Bryan

Write readable code

  • Naming matters

    • When naming files, remember the following three rules:
      • Machine-readable (avoid spaces, punctuation, periods, and any other special characters except _ and -)
      • Human readable (should be meaningful. No text1, image1, etc.,)
      • Ordering (e.g., 01, 02, 03, … )
# Good
fit_models.R

# Bad
fit models.R
  • When naming objects:
    • Don’t use special characters.
    • Don’t capitalize.
# Good 
day_one
    
# Bad 
DayOne
  • When naming functions:
    • Don’t use special characters.
    • Don’t capitalize.
    • Use verbs instead of nouns. (Functions do something!)
# Good 
run_rdd 

# Bad 
rdd
  • Spacing

Some people do spacing by pressing the Tab key, and others do it by pressing the Space key multiple times.

# Good
x[, 1] 

mean(x, na.rm = TRUE) 

# Bad

x[,1]

mean (x, na.rm = TRUE)
  • Indenting

Indent at least 4 spaces. Note that some people, including none other than Roger Peng, indent 8 spaces. You can change the default indentation setting using the RStudio configuration. Go to Tools > Global Options… > Code > Editing, and look for the option “Insert spaces for Tab”.

# Good
if (y < 0) {
  message("y is negative")
}

# Bad
if (y < 0) {
message("Y is negative")}
  • Long lines
# Good
do_something_very_complicated(
  something = "that",
  requires = many,
  arguments = "some of which may be long"
)

# Bad
do_something_very_complicated("that", requires = many, arguments =
                              "some of which may be long"
                              )
  • Comments
    • Use comments to explain your decisions.
    • But, show your code; Do not try to explain your code by comments.
    • Also, try to comment out rather than delete the code you experiment with.
# Average sleep hours of Jae
jae %>%
  # By week
  group_by(week) %>%
  # Mean sleep hours 
  summarise(week_sleep = mean(sleep, na.rm = TRUE))
  • Pipes (chaining commands)
# Good
iris %>%
  group_by(Species) %>%
  summarize_if(is.numeric, mean) %>%
  ungroup() %>%
  gather(measure, value, -Species) %>%
  arrange(value)

# Bad
iris %>% group_by(Species) %>% summarize_all(mean) %>%
ungroup %>% gather(measure, value, -Species) %>%
arrange(value)
  • Additional tips

  • Use lintr to check whether your code complies with a recommended style guideline (e.g., tidyverse) and styler package to format your code according to the style guideline.

Write reusable code

  • Pasting

Copy-and-paste programming, sometimes referred to as just pasting, is the production of highly repetitive computer programming code, as produced by copy and paste operations. It is primarily a pejorative term; those who use the term are often implying a lack of programming competence. It may also be the result of technology limitations (e.g., an insufficiently expressive development environment) as subroutines or libraries would normally be used instead. However, there are occasions when copy-and-paste programming is considered acceptable or necessary, such as for boilerplate, loop unrolling (when not supported automatically by the compiler), or certain programming idioms, and it is supported by some source code editors in the form of snippets. - Wikipedia

  • It’s okay for pasting for the first attempt to solve a problem. But if you copy and paste three times (a.k.a. Rule of Three in programming), something’s wrong. You’re working too hard. You need to be lazy. What do I mean, and how can you do that?

  • The following exercise was inspired by Wickham’s example.

  • Let’s imagine df is a survey dataset.

    • a, b, c, d = Survey questions

    • -99: non-responses

    • Your goal: replace -99 with NA

# Data
library(tibble)
set.seed(1234) # for reproducibility 

df <- tibble("a" = sample(c(-99, 1:3), size = 5 , replace= TRUE),
             "b" = sample(c(-99, 1:3), size = 5 , replace= TRUE),
             "c" = sample(c(-99, 1:3), size = 5 , replace= TRUE),
             "d" = sample(c(-99, 1:3), size = 5 , replace= TRUE))
# Copy and paste 
df$a[df$a == -99] <- NA
df$b[df$b == -99] <- NA
df$c[df$c == -99] <- NA
df$d[df$d == -99] <- NA

df
# A tibble: 5 × 4
      a     b     c     d
  <dbl> <dbl> <dbl> <dbl>
1     3     3     3     1
2     3     2     3     1
3     1    NA     1     2
4     1    NA     2     1
5    NA     1     1     3
  • Using a function
    • function: input + computation + output
    • If you write a function, you gain efficiency because you don’t need to copy and paste the computation part.
# Create a custom function
fix_missing <- function(x) { # INPUT
  x[x == -99] <- NA # COMPUTATION
  x # OUTPUT 
}

# Apply the function to each column (vector)
# This iterated part can and should be automated.
df$a <- fix_missing(df$a)
df$b <- fix_missing(df$b)
df$c <- fix_missing(df$c)
df$d <- fix_missing(df$d)

df
  • Automation
    • Many options for automation in R: for loop, apply family, etc.
    • Here’s a tidy solution that comes from the purrr package.
    • The power and joy of one-liner.
df <- purrr::map_df(df, fix_missing) # What is this magic? We will unpack the blackbox (`map_df()`) later.

df
  • Takeaways
  1. Your code becomes more reusable when it would be easier to change, debug, and scale-up. Don’t repeat yourself and embrace the power of lazy programming.

Lazy, because only lazy programmers will want to write the kind of tools that might replace them in the end. Lazy, because only a lazy programmer will avoid writing monotonous, repetitive code—thus avoiding redundancy, the enemy of software maintenance and flexible refactoring. Mostly, the tools and processes that come out of this endeavor fired by laziness will speed up the production. - Philipp Lenssen

  1. Only when your code becomes reusable, you would become efficient in your data work. Otherwise, you need to start from scratch or copy and paste, when you work on a new project.

Code reuse aims to save time and resources and reduce redundancy by taking advantage of assets that have already been created in some form within the software product development process.[2] The key idea in reuse is that parts of a computer program written at one time can be or should be used in the construction of other programs written at a later time. - Wikipedia

Test your code systematically

I strongly recommend switching from adhoc testing to formal automated testing (i.e., unit testing).

Whenever you are tempted to type something into a print statement or a debugger expression, write it as a test instead. — Martin Fowler the author of Refactoring

R language tip: Test your code with testthat by InfoWorld /p>

if (!require(testthat)) install.packages("testthat")
Loading required package: testthat
pacman::p_load(testthat)

context("Variable check")

test_that("Check whether instructor variable is setup correctly", {
  
  instructors <- c("Jae", "Nick")

  expect_equal(class(instructors), "character")

}
)
Test passed 🎉

Inspired by an example in Hadley Wickham’s R Journal paper (2011).

context("Model check")

test_that("Check whether the model is lm", {
  
  model <- lm(mpg ~ wt, data = mtcars)
  
  # Passes
  expect_that(model, is_a("lm"))

  # Fails
  expect_that(model, is_a("glm"))

})

Run tests

test_file(file.choose()) # file 

test_dir() # directory

auto_test() # the test code tested when you save the file 

Asking questions: Minimal reproducible example

  • Chances are you’re going to use StackOverflow a lot to solve a pressing problem you face. However, others can’t understand/be interested in your problem unless you provide an example that they can understand with minimal effort. Such an example is called a minimal reproducible example (MRE).

  • Read this StackOverFlow post to understand the concept and best practices.

  • Simply put, an MRE consists of the following items:

    • A minimal dataset
    • The minimal burnable code
    • The necessary information on package, R version, system (use sessionInfo())
    • A seed for reproducibility (set.seed()), if you used a random process.

In practice, use the reprex package to create the code component of the MRE.

if (!require(reprex)) install.packages("reprex")

Copy the following code and type reprex() in the console.

gpa <- c(3, 4, 4, 3)
mean(gpa)

References