Writing Internal Libraries for Analytics Work

Or packages, or modules, or whatever you wish to call them

R
data science
Author

Matt Kaye

Published

April 1, 2023

This post is part of a series called The Missing Semester of Your DS Education.

Introduction

By definition, library code is code that’s written to being reused by programs other than itself that are unrelated to each other. For instance, dplyr (R) and pandas (Python) are common examples of library code: Instead of writing code from scratch to work with tabular data, you might use one of those two fantastic libraries. And you get some additional benefits from using them:

  1. Those libraries are well-documented, so it’s easy to figure out how to use them.
  2. They’re well-tested, so you (presumably) know that bugs are less likely than if you were to try to write the same functionality from scratch.
  3. They’re (relatively) performant.
  4. They’re widely used, so it’s easy to find answers to questions and get help from the communities using them.

A Common Library

At CollegeVine, we have collegeviner: An R package containing a lot of code that we use very often for all kinds of analytics projects. Some things that live in collegeviner include:

  1. Plot theming code, so that we can consistently theme graphics across all of our work.
  2. A custom implementation of item-based collaborative filtering, which is core to our school recommendation system.
  3. DBI and aws.s3 wrappers for connecting to and querying our databases and working with data in S3.
  4. Helper methods for common math we do, such as converting between odds, log-odds, and probabilities.
  5. Miscellaneous helper code for things that don’t exist natively in R, such as yeet for removing an item from a list and %notin%, the inverse of the %in% operator.
  6. An implementation of the Brier Skill Score, which is a metric we often use for evaluating classification models.
  7. A lot more!

You might think of collegeviner as being a common library of things that our team does often, so we don’t need to repeat ourselves or reinvent the wheel.

A Toy Example

Let’s imagine that you’re setting up a common library (in this example, an R package) for your team. The first thing you might want to do is have some logic to help your team connect to your data warehouse. For this example, let’s just imagine that “warehouse” is your local Postgres instance. Then, you might write a method for your library called connect_to_dwh like this:

library(DBI)
library(rlang)
library(httr)
library(RPostgres)

connect_to_dwh <- function(url = Sys.getenv("DATA_WAREHOUSE_URL")) {
  if (url == "") abort("You must specify a dwh URL.")

  check_installed(pkg = "httr")
  check_installed(pkg = "RPostgres")

  db_params <- parse_url(url)

  db_drv <- Postgres()
  db_user <- db_params$username
  db_password <- db_params$password
  db_host <- db_params$hostname
  db_port <- db_params$port %||% 5432
  db_name <- db_params$path

  dbConnect(
    db_drv,
    dbname = db_name,
    host = db_host,
    port = db_port,
    user = db_user,
    password = db_password
  )
}

Now you have a single function that your whole team can share to connect to your data warehouse, assuming that they can provide the connection string. Let’s test out how the workflow might look for querying data now.

Connecting

Sys.setenv(DATA_WAREHOUSE_URL = "postgresql://localhost:5432/postgres")

conn <- connect_to_dwh()

And that’s it! You’re connected. You can now query away using dbGetQuery(), a custom wrapper, dbplyr, or any other method of choice.

Querying

## Put some data into the warehouse for example purposes
dbWriteTable(conn, Id(schema = "blog", table = "iris"), janitor::clean_names(iris))

result <- dbGetQuery(
  conn = conn,
  "
  SELECT species, MAX(sepal_length) AS max_sepal_length
  FROM blog.iris
  GROUP BY species
  ORDER BY 2 DESC
  "
)

pretty_print(result)
species max_sepal_length
virginica 7.9
versicolor 7.0
setosa 5.8

Testing

It’s also important to test your code. testthat makes writing unit tests for your new connect_to_dwh function very simple.

library(testthat)

test_that("Connecting works as expected", {
  ## This should error because the URL is empty
  expect_error(
    connect_to_dwh(""),
    "You must specify a dwh URL"
  )
  
  conn <- connect_to_dwh()

  ## Should return a PqConnection object
  expect_s4_class(conn, "PqConnection")
  
  ## Should be able to query an example table
  expect_equal(
    dbGetQuery(conn, "SELECT COUNT(*) FROM blog.iris")$count,
    150
  )
})
Test passed 😀

Versioning

Lastly, it’s important to version your code. Semantic Versioning (SemVer) is a very common standard for versioning library code. In R specifically, you can read about versioning in Chapter 22 of R Packages.

In our toy example, this means that if you change how the logic of your connect_to_dwh function works, you should change the version of your package so that your users (your teammates) don’t get blindsided by your change. Incrementing your package’s version shows your teammates that something has changed in your library, and they can update their code to rely on the latest version (if they wish), or continue using the current version they’re on, or anything else.

Note that being able to control which version of a library your code is using requires some manner of managing dependencies. In R, I would highly recommend renv. In Python, I like Poetry.

One Library Per Model

In addition to a common library for sharing code that’s very often used across the data org, our team has also gotten into the habit of having a library per ML model in production. This definition can be a bit flexible (both in terms of what “ML model” means, and also what “production” means), but the basic principle should be the same: ML in production requires at least some training logic and some monitoring logic. It’s a good idea to share code between those two things. Let’s consider another simple example.

Iris Species Prediction

Let’s imagine that we work for a florist. On our website, we have a service where someone can provide some measurements about an iris (either a setosa or a virginica), as we’ll tell them which of the two we think it is. We know we’ll want to retrain the model periodically as we get more data, and we’ll also want to monitor how our model performs out-of-sample. Both training the model and monitoring will require some shared logic: loading raw data, doing feature engineering, and making predictions. So it would make sense to have those two jobs rely on a single library, as opposed to needing to repeat the logic. Let’s write that library here.

Fetching the Data

First, let’s write a method to fetch the raw data from our data warehouse. In practice, it probably makes sense to factor out the SQL here into individual SQL scripts, but for this example I’ll just include the SQL directly as a string.

fetch_raw_data <- function(conn) {
  dbGetQuery(
    conn = conn,
    "
    SELECT *
    FROM blog.iris
    WHERE species IN ('setosa', 'virginica')
    "
  )
}

Feature Engineering

Next, let’s write some methods to create features.

create_sepal_length_feature <- function(sepal_length) {
  sepal_length + rnorm(n = length(sepal_length))
}

create_petal_width_feature <- function(petal_width) {
  petal_width + rgamma(n = length(petal_width), shape = 2)
}

And then we can write a function to take our raw data, run the feature engineering steps, and return the features.

create_features <- function(raw_data) {
  sepal_length <- create_sepal_length_feature(raw_data$sepal_length)
  petal_width <- create_petal_width_feature(raw_data$petal_width)
  is_setosa <- raw_data$species == "setosa"
  
  data.frame(
    sepal_length,
    petal_width,
    is_setosa
  )
}

Model Fitting and Prediction

Next, let’s write methods to fit the model and make predictions.

fit_model <- function(features) {
  formula <- is_setosa ~ sepal_length + petal_width

  ## Dynamically extract the variables in the formula
  ## so we don't need to repeat ourselves
  predictors <- as.character(labels(terms(formula)))
  target <- as.character(formula[[2]])
  
  ## Very basic error handling
  if (!all(c(predictors, target) %in% colnames(features))) {
    abort("Some required columns were missing from `features`")
  }
  
  model <- glm(
    formula,
    data = features,
    family = binomial
  )
  
  class(model) <- c("irises", class(model))
  
  model
}

predict.irises <- function(object, newdata = NULL, ...) {
  probs <- predict.glm(
    object,
    newdata,
    type = "response"
  )
  
  ## If the predicted probability is > 50%,
  ## return `true`, else return `false`
  probs > 0.50
}

Model Evaluation

And finally, let’s add a function to compute the model’s accuracy on some evaluation data.

compute_accuracy <- function(prediction, is_setosa) {
  100 * sum(prediction == is_setosa) / length(prediction)
}

Testing

It’s important to note that all of the methods above can and should be unit tested in the same way we tested our helper for connecting to the database. Testing is a great way to ensure the correctness of your code and make it more maintainable by making it easier to refactor in the future, and putting all of your modeling logic into a library like this makes it very easy to test. For instance, here’s how you might write a couple of unit tests for the petal width feature.

test_that("Petal width feature is created correctly", {
  ## The feature should be positive even when the
  ## petal width is zero, since we're adding gamma
  ## random noise.
  expect_gt(create_petal_width_feature(0), 0)
  
  ## It should be extremely unlikely that a single 
  ## draw from a gamma(2) is >10, which means this
  ## feature should be < 10 when the input is 0 in 
  ## the vast majority of cases.
  ##
  ## NOTE: This is by definition a brittle test, and
  ## I wouldn't recommend writing tests that are
  ## probabilistic like this in practice unless
  ## you really need to. If you do, this will
  ## fail _some_ of the time, at random, even
  ## if "some" is a very small percentage.
  purrr::walk(
    rep(0, 100), 
    function(x) {
      expect_lt(
        create_petal_width_feature(x),
        10
      )
    } 
  )
})
Test passed 🎉

A Retraining Job

Great! Now that we have all of that library code written, we can package it up into a retraining job. A very simple training job might look like this:

First, connect to the data warehouse

conn <- connect_to_dwh()

Next, fetch the raw data from the warehouse that we need to train the model.

raw <- fetch_raw_data(conn)

pretty_print(head(raw))
sepal_length sepal_width petal_length petal_width species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

Next, create the features from the raw data by running the feature engineering pipeline.

features <- create_features(raw)

pretty_print(head(features))
sepal_length petal_width is_setosa
4.340 2.766 TRUE
7.251 0.913 TRUE
4.519 5.098 TRUE
5.627 0.919 TRUE
5.339 4.325 TRUE
6.146 4.990 TRUE

Then fit a model using the features.

model <- fit_model(features)

pretty_print(coef(model))
x
(Intercept) 7.337
sepal_length -0.896
petal_width -0.592

Finally, evaluate the performance of the model by making predictions and computing the accuracy of those predictions.

preds <- predict(model)
accuracy <- compute_accuracy(preds, features$is_setosa)

cat(paste0("Model accuracy is ", accuracy, "%"))
Model accuracy is 75%

And that’s it – you have a simple retraining job. This is a very minimal example, but this general framework is very flexible and modular, and it makes up the foundation for how we write our retraining jobs at CollegeVine. You can plug and play all different kinds of feature engineering logic, logic to fetch raw data, metrics, etc. We also use MLFlow for versioning models and tracking experiments, so our retraining jobs have lots of logging of artifacts, parameters, and metrics to our MLFlow instance.

A Monitoring Job

Next, let’s imagine we want to monitor out-of-sample performance of the model. Let’s modify the table with our raw data in the database for this, just for the sake of an example.

transaction_result <- dbWithTransaction(
  conn = conn,
  {
    dbExecute(
      conn = conn,
      "
      ALTER TABLE blog.iris
      ADD COLUMN created_at TIMESTAMP
      "
    )
    
    dbExecute(
      conn = conn,
      "
      UPDATE blog.iris
      SET created_at = NOW() - random() * INTERVAL '2 weeks'
      "
    )
  }
)

Great, and now let’s make one or two small modifications to our code from above that pulled the raw data from the data warehouse.

fetch_raw_data <- function(conn, created_after = '1970-01-01 00:00:00') {
  dbGetQuery(
    conn = conn,
    sprintf(
      "
      SELECT *
      FROM blog.iris
      WHERE 
        species IN ('setosa', 'virginica')
        AND created_at > '%s'
      ",
      created_after
    )
  )
}

All we’ve done here is added the ability to specify a created_at date to use as the cutoff point, where we’d only include records in our raw data that were created after said point. In practice, this lets us filter our raw data down to only records that were created after the model was trained (the out-of-sample data).

## In practice, this would be set at training time and "frozen"
## possibly by logging the value as a parameter in the MLFlow run
model_trained_at <- median(fetch_raw_data(conn)$created_at)

And now that we’ve artificially created a trained_at date for the model, we can run our monitoring job. It’s quite simple, and very similar to the retraining job. All we do here is pull raw data that has been created since the model was trained, run the feature engineering pipeline, make predictions, and compute the accuracy of the model out-of-sample.

raw <- fetch_raw_data(conn, created_after = model_trained_at)

features <- create_features(raw)
predictions <- predict(model, features)

accuracy <- compute_accuracy(predictions, features$is_setosa)

cat(paste0("Out-of-sample accuracy is ", accuracy, "%"))
Out-of-sample accuracy is 74%

Tying it Together

The key piece to notice is how much we’re leveraging our library code in both the retraining and monitoring job. In both cases, we’re doing some very similar things – pulling raw data, creating features, making predictions, computing accuracy – so it makes a lot of sense that we’d want to reuse the code for those two jobs.

There might also be more use cases for the code: More retraining or monitoring jobs, REST APIs, ETL jobs, etc. The more times you need to rely on the same logic, the more benefit you’ll derive from having a single source of truth for all of the logic for your modeling process.

It also might be useful to separate this library from the common library proposed at the start. There are important tradeoffs to consider: On one hand, a single library might be convenient for having all of your logic in a single place. But on the other hand, as your library grows in scope, it’ll necessarily have a bigger footprint, rely on more dependencies, etc. which will make its use and maintenance more difficult. A happy middle ground for us has been having a single library per “model” or use case. For instance, at CollegeVine we have a package called mlchancing for our chancing model and a separate package called schoolrecommendr for school recommendations and affinity scoring. Keeping these separate has made it easier to iterate on each model individually while also not being a maintenance or ramp-up headache.

It’s my view that models and other analytics work that is in production is software, and should be treated as such. If a model is going to be shipped to production, it at the very least needs to be tested, documented, versioned, and put through some kind of CI/CD process. It’d be even better if it’s monitored automatically so that the data scientists working on it can be notified quickly if things start going wrong. Ultimately, writing library code for your modeling work is very well-suited to meeting all of these expectations. And it also just makes everyone’s lives easier by not needing to reinvent the wheel all the time.