Simple ETL Pipeline in R

Kolapo Obajuluwa
4 min readFeb 13, 2022

Caution! No for loops were used in the making of this post.

Photo by Mike Benna on Unsplash

Hi! How’s it going? I’m well too, thanks for asking! It’s been too long since my last post, why? Let’s just thank God for love and life.

This will be a short read (at least I hope it will), mostly because it’s a mirror post. Let me explain.

I came across an interesting medium post by Nazia Habib (huge shout-out) where she detailed how to create a simple ETL pipeline in python. I recommend quickly going through it as I shall try to avoid repeating anything he discussed there.

My goal after reading his post was to replicate it but in R, so the following details how I achieved that.

For some context (as I suspect you may not have checked out her post 🙄), the aim is to query movie data from The Movie Database API, transform the data, extract three tables (movies, genres list and datetimes) and load them into a desired location/format (Excel Workbook for me and csvs for Nazia). Really, this could go into a database, data warehouse, google sheets or whatever destination you please but for simplicity, an Excel workbook works just fine for me now.

You can skip the explanation and view the full commented source code on my github.

Getting Started

Nazia already explains how to acquire an access key from the tmdb API so I’ll just skip ahead to loading the api_key and importing relevant libraries.

Extract

Data is queried from our API by movie_id, an integer. To query a single data point, we just call the url with the desired movie_id and our API key as shown below.

Now let’s get data for a range of movies, say movies with ids from 560 to 565.

To do this, we first declare our range of IDs. Next, using the glue function, we generate a vector (list) of urls with each ID and our api_key.

Next, we write a function to query each url, extract the required data and return our desired response. This is done by customizing our GET function to further filter for the desired content.

After verifying that our function works, we can vectorize it with the iterative argument being the url. We do this to enable R loop through the url_list efficiently and return a list of responses to us ready to be bound into a dataframe.

Transform

Next, we want to select only the columns we’re interested in and unnest them as they’re still named lists.
We’ll address the ‘genres’ column in a bit.

To create the genres table, we pull our ‘genres’ column and extract all unique genres into a table as shown below.

For our main movies table, we want to be able to filter through the genres easily such that if a movie belongs to a genre, it has a row value of 1 and if not, the value is 0.

To do this, we unnest the ‘genres’ column and pivot_wider (spread) the column with column names as each unique genre and values as the id (genre id). We then run a simple custom function (serializer_fn) to replace all values in the genre columns with 1 or 0 depending on whether or not they exist (are not NA).

For the last table, the datetime table, using the lubridate package, we can easily extract the day, month, year and day_of_week from the release_date column.

Load

All three tables are created and ready to be loaded into our Excel Workbook destination. This file will be created in our working directory.

Well, this has been fun! Thanks for reading this far! My argument for writing ETL scripts in R would be that translating my intention of data transformation to code is more intuitive in R than in Python but hey! Do what works for you!

Have a great one and feel free to leave comments and/or questions.

Also connect with me on LinkedIn to collaborate on projects and talk about data in all flavours!

--

--

Kolapo Obajuluwa

Just a lad passionate about data and all the cool stuff you can do with it!