A Gentle Introduction to Docker

I need to run my code somewhere other than my machine. How do I do it?

data science

Author

Matt Kaye

Published

June 6, 2023

This post is part of a series called The Missing Semester of Your DS Education.

Introduction

If you’re doing data science work, it’s likely you’ll eventually come across a situation where you need to run your code somewhere else. Whether that “somewhere” is the machine of a teammate, an EC2 box, a pod in a Kubernetes cluster, a runner in your team’s CI/CD rig, on a Spark cluster, and so on depends greatly on the problem you’re solving. But the ultimate point is the same: Eventually, you’ll need to be able to package your code up, put it somewhere in the world other than your local machine, and have it run just like it has been for you.

Enter: Docker

Docker seems very complicated at first glance. And there’s a lot of jargon: Images, containers, volumes, and more, and that doesn’t even begin to cover the world of container orchestration: Docker-Compose, Kubernetes, and so on. But at its core, you can think of Docker as a little environment – not too unlike your local machine – that has a file system, configuration, etc. that’s packaged up into a magical box that you can run on any computer, anywhere. Or at least on computers that have Docker installed.

It might be simplest to consider a small example.

Note that to run the example that will follow, you’ll need to have Docker installed on your machine. All of the files necessary to run this example can be found in my blog’s Github repo

I’ll use R for the example I provide in this post, but note that the same principles apply if you’re doing your work in Python, or in any other programming language.

Let’s imagine we want to print “Hello from Docker!” from R. First, make a new directory called docker-example (or whatever you want to call it):

mkdir docker-example && cd docker-example

And then we might do something like the following:

## Dockerfile

FROM rocker/r-ver:4.2.0

CMD ["Rscript", "-e", "'Hello from Docker!'"]

If you paste that into a file called Dockerfile, you can then run:

docker build --tag example .

Which will build the Docker image by running each command you’ve specified. Going line by line, those commands are:

Use the rocker/r-ver:4.2.0 image as the base image. In Docker, base images are useful because they come with things (such as the R language) pre-installed, so you don’t need to install them yourself. rocker/r-ver:4.2.0 ships with R version 4.2.0 pre-installed, which means you can run R as you would on your local.
After declaring the base image, we specify a command to run when docker run is invoked. This command is simple – it just prints Hello from Docker!.

Once the build has completed, you can:

docker run example

and you should see:

[1] "Hello from Docker!"

Tada 🎉! You just ran R in a Docker container. And since you have your code running in Docker, you could now run the same code on any other machine that supports Docker.

More Complicated Builds

Of course, this example was trivial. In the real world, our projects are much more complex. They have dependencies, they rely on environment variables, they have scripts that need to be run, and so on.

Copying Files

Let’s start with running a script instead of running R from the command line as we have been.

Create an R script called example.R that looks like this:

## example.R

print("Hello from Docker!")

And then you can update the Dockerfile by adding a COPY command to copy the script into your image, as follows.

## Dockerfile

FROM rocker/r-ver:4.2.0

COPY example.R example.R

CMD ["Rscript", "example.R"]

The COPY command tells Docker that you want to take example.R and put it into your image at /example.R. You can also specify a file path in the image, but I’m just putting the files I copy in at the root.

Finally, let’s build and run our Docker image again:

docker build --tag example .

docker run example

Amazing! You can see in the build logs that the example.R script was copied into the image:

 => [2/3] COPY example.R example.R

and then running the image gives the same result as before:

[1] "Hello from Docker!"

Installing Dependencies

You’ll generally also need to install dependencies, which you can do using the RUN command. Let’s update the Dockerfile to install glue.

## Dockerfile

FROM rocker/r-ver:4.2.0

COPY example.R example.R

RUN Rscript -e "install.packages('glue')"

CMD ["Rscript", "example.R"]

Now, the third step in the build installs glue. And to show it works, we’ll use glue to do a bit of string interpolation, printing the R version that’s running from the R_VERSION environment variable. Update example.R as follows:

## example.R

library(glue)

print(glue('Hello from Docker! I am running R version {Sys.getenv("R_VERSION")}'))

Building and running again should give you some new output. First, you should see glue installing in the build logs:

 => [3/3] RUN Rscript -e "install.packages('glue')"

And once you run the image, you should see:

Hello from Docker! I am running R version 4.2.0

Woohoo! 🥳🥳

Using renv

But as I wrote about in my last post, having global dependency installs is usually a bad idea. So we probably don’t want to have an install.packages() as a RUN step in the Dockerfile. Instead, let’s use renv to manage our dependencies.

From the command line, run:

Rscript -e "renv::init()"

Since you already are using glue in your project, this will generate a lockfile that looks something like this:

{
  "R": {
    "Version": "4.2.0",
    "Repositories": [
      {
        "Name": "CRAN",
        "URL": "https://cloud.r-project.org"
      }
    ]
  },
  "Packages": {
    "glue": {
      "Package": "glue",
      "Version": "1.6.2",
      "Source": "Repository",
      "Repository": "CRAN",
      "Requirements": [
        "R",
        "methods"
      ],
      "Hash": "4f2596dfb05dac67b9dc558e5c6fba2e"
    },
    "renv": {
      "Package": "renv",
      "Version": "0.17.3",
      "Source": "Repository",
      "Repository": "CRAN",
      "Requirements": [
        "utils"
      ],
      "Hash": "4543b8cd233ae25c6aba8548be9e747e"
    }
  }
}

It’s important to keep the version of R you’re running in your Docker containers in sync with what you have on local. I’m using 4.2.0 in my Docker image, which I defined with FROM rocker/r-ver:4.2.0, and that version is the same version that’s recorded in my renv.lock file. In Python, you might use a tool like pyenv for managing Python versions.

Now that we have renv set up, we can update the Dockerfile a bit more:

## Dockerfile

FROM rocker/r-ver:4.2.0

COPY example.R example.R
COPY renv /renv
COPY .Rprofile .Rprofile
COPY renv.lock renv.lock

RUN Rscript -e "renv::restore()"

CMD ["Rscript", "example.R"]

Now, we’re copying all of the renv scaffolding into the image. And instead of running install.packages(...), we’ve replaced that line with renv::restore() which will look at the lockfile and install packages as they’re defined. Rebuilding and running the image again will give you the same result as before.

Now that we have a script running in Docker and are using renv to declare and install dependencies, let’s move on to…

Environment Variables

Sometimes we need environment variables like a Github token or a database URL, either to install our dependencies or to run our code. Depending on when the variable will be used, we can either specify it at build time (as a build arg) or a run time. Generally, it’s a good idea to only specify build args that you really need at build time.

Build Time Config

For instance, if your build requires downloading a package from a private Github repository (for which you need to have a GITHUB_PAT set), then you would specify your GITHUB_PAT as a build arg. Let’s try that:

## Dockerfile

FROM rocker/r-ver:4.2.0

ARG GITHUB_PAT

## Note: don't actually do this
## It's just for the sake of example
RUN echo "$GITHUB_PAT"

COPY example.R example.R
COPY renv /renv
COPY .Rprofile .Rprofile
COPY renv.lock renv.lock

RUN Rscript -e "renv::restore()"

CMD ["Rscript", "example.R"]

Now, the second line adds a build arg using ARG. Next, run the build:

docker build --tag example --build-arg GITHUB_PAT=foobar .

You should see the following in the logs:

 => [2/7] RUN echo "foobar"

This means that your variable GITHUB_PAT has been successfully set, and can be used at build time for whatever it’s needed for.

This is just an example, but it’s important that you don’t expose secrets in your build like I’ve done here. If you’re using (e.g.) a token as a build arg, make sure it’s not printed in plain text to your logs.

Runtime Config

Other times, you want config to be available at container runtime. For instance, if you’re running a web app, you might not need to be able to connect to your production database when you’re building the image that houses your app. But you need to be able to connect when the container boots up. To achieve this, use --env (or --env-file). We’ll update our example.R a bit to show how this works.

## example.R

library(glue)

print(glue('Github PAT: {Sys.getenv("GITHUB_PAT")}'))

print(glue('Database URL: {Sys.getenv("DATABASE_URL")}'))

print(glue('Hello from Docker! I am running R version {Sys.getenv("R_VERSION")}.'))

And then, let’s rebuild:

docker build --tag example --build-arg GITHUB_PAT=foobar .

and now we’ll run our image, but this time with the --env flag:

docker run --env DATABASE_URL=loremipsum example

This tells Docker that you want to pass the environment variable DATABASE_URL=loremipsum into the container running your example image when the container boots up.

And after running, you’ll see something like this:

Github PAT: 
Database URL: loremipsum
Hello from Docker! I am running R version 4.2.0.

There are a few things to note here.

The GITHUB_PAT that you set as a build arg is no longer accessible at runtime. It’s only accessible at build time.
The DATABASE_URL we provided with the --env flag is now accessible as an environment variable named DATABASE_URL

Very often, container orchestration platforms like Heroku, Digital Ocean, AWS Batch, etc. will allow you to specify environment variables via their CLI or UI, which they will then inject into your container for you when it boots up.

Advanced Topics

This post is intended to be a gentle introduction to Docker, but there’s a lot that’s missing. I’d like to quickly address a couple more topics that have been helpful to me and our team as we’ve relied more and more heavily on Docker.

Custom Base Images

You might have noticed that your builds take longer as they do more. For instance, the glue install, even though glue is extremely lightweight, takes a few seconds. If you have many dependencies (R dependencies, C dependencies, etc.), building an image for your project can get prohibitively slow. For us in the past, builds have taken over an hour just restoring the dependencies recorded in the lockfile.

A convenient way around this is to install some “base” dependencies that you’ll update rarely and use often into a base image, which you then push to a repository like Docker Hub and then use as your base image in the FROM ... line of your Dockerfile for any particular project. This prevents you from needing to install the same, unchanged dependencies over and over again.

We’ve had a lot of success using this strategy on a few particular fronts:

Making sure we’re using the same version of R everywhere is simple if we define the R version in one place with a FROM rocker/r-ver:4.2.0 in our base image (which is called collegevine/r-prod-base), and then we use FROM collegevine/r-prod-base as the base image for all of our other Docker builds.
Installing Linux dependencies, such as curl, unzip, etc. which we’re happy keeping on a single version can happen once in the base image, and then every downstream image can rely on those same dependencies.
Installing CLIs like the AWS CLI, which again, really doesn’t need to happen on every build.

CI / CD

The other time-saving strategy we’ve greatly benefited from is aggressive caching R packages in our CI / CD process. renv has great docs on using it within a CI / CD rig which I would highly recommend.

At a high level, what we do is renv::restore() in the CI itself (before running the docker build ...), which installs all of the packages our project needs. Then we COPY the cache of packages into our image, so that they’re available inside of the image. This means we don’t need to reinstall every dependency on every build, and has probably sped up our image build times by 100x.

Wrapping Up

I hope this post has demystified Docker a bit and helped clarify some of the basics of Docker and how it’s used. At the highest level, Docker lets you package up your code so that it can be run anywhere, whether that’s on your machine, on the machine of a coworker, in a CI/CD tool, on a cloud server like an EC2 box, or anywhere else. All you need to do is build and push your image!

Subscribe

Introduction

Enter: Docker

More Complicated Builds

Copying Files

Installing Dependencies

Using renv

Environment Variables

Build Time Config

Runtime Config

Advanced Topics

Custom Base Images

CI / CD

Wrapping Up