In this post, I’ll show you how to automate data scraping and web deployment using a combination of Jupyter notebooks, papermill, Google Sheets API, and cron jobs running on a virtual machine. The main focus is on how these pieces fit together to create a fully automated pipeline that scrapes data, processes it, and deploys a styled webpage - all without manual intervention.
This approach is particularly useful for creating automated dashboards that need regular data updates, monitoring systems that track changing metrics over time, or any scenario where you want to maintain historical data from sources that don’t preserve it themselves. It’s essentially a low-overhead, fully automated CI/CD pipeline for small datasets that requires no dedicated servers or complex infrastructure.
The beauty is that everything runs in a single Jupyter notebook environment, making it easy to debug, modify, and understand the entire pipeline at a glance. A simple cron job triggers a notebook that scrapes data, stores it, converts itself to HTML, and deploys automatically - turning Jupyter into a complete web application platform.
The Big Picture
This automation setup demonstrates how to:
- Automatically execute a Jupyter notebook on a schedule
- Scrape web data and store it in Google Sheets
- Convert the notebook to HTML and deploy it to GitHub Pages
- Host everything on a Google Cloud VM with cron job scheduling
The result is a self-updating webpage that displays fresh data daily, hosted automatically through GitHub Pages.
Project Overview
I built this system to track COVID-19 metrics for Santa Barbara County. Every day, the California Department of Public Health publishes updated COVID tracking data, but they don’t preserve historical data. This tool automatically scrapes the daily metrics and maintains a running table of historical data. Specifically, I want to track when the county’s improving metrics would qualify it for removal from the state’s monitoring list and allow businesses to reopen.
Technical components:
- Jupyter notebook for scraping and data processing
- Google Sheets API for data storage
- Papermill for notebook automation
- Cron job for scheduling
- GitHub Pages for hosting
- Google Cloud Platform VM for execution
The live result is hosted at https://alicelepissier.com/COVID-SB, automatically updating daily at noon.
Step 1: Setting Up the Compute Environment with a Virtual Machine + Docker container
1. Create the VM
In a previous post, we learned how to create a virtual machine (VM) on Google Cloud Platform. We are going to be using the machine we created then as our computing environment.
Setting up the project on Windows. You can also run this entire setup on your local machine if you prefer - the repository includes a Windows batch file for automating execution through Task Scheduler instead of cron. Simply update windows_scheduled_task.bat
to point to your own file directories for the Python executable, and whichever folder this project will live in. Then open Task Scheduler by typing Win + R followed by taskschd.msc
, and schedule a task to execute the .bat
file daily. The rest of the tutorial will proceed based on the assumption that we are running this on Linux, but the core automation logic remains identical since everything runs through Python inside a Jupyter notebook. You can find the list of required packages for this project in requirements.txt
.
2. Spin up a Docker Container
A full Docker tutorial is outside the scope of this post, but we are going to be setting up a Docker container in order to ensure our scraping environment stays consistent and portable across different systems. The repo contains the Dockerfile
and the docker-compose.yml
you need to replicate the specific compute environment that contains all of the required packages pre-installed to run this project.
First, install Docker: sudo apt-get update && sudo apt-get install -y docker.io docker-compose
.
Second, add yourself to the docker
group to give your user account permission to run Docker Compose without needing sudo
every time: sudo usermod -aG docker $USER
.
Third, you must log out and log back in to your instance in order for the changes to apply to the group settings: logout
.
Finally, log back in to your Google VM instance by running gcloud beta compute ssh --zone "<your instance zone>" "<your instance name>" --project "<your project ID (not name)>"
. Find more explanations on SSH access to a Google Cloud VM here.
You are now ready to spin up the Docker container: docker-compose up -d
. The flag -d
stands for detached mode: this starts the container in the background, and keeps it running persistently until you explicitly stop it (with docker-compose down
or docker stop
).
For this automation project, you only need to spin up the container once; the automation logic explained below will handle updating the Jupyter notebook and deploying it automatically to GitHub.
3. Connect to a GitHub Repository
Since the project relies on GitHub Pages to host the self-updating webpage, you’re going to need a dedicated repo and to be able to push to it via the CLI.
In our case, the GitHub repo is walice/COVID-SB/.
Feel free to clone this repository with git clone git@github.com:walice/COVID-SB.git
, or create a new empty repository on GitHub1 and clone that. This is where the project files should live.
Check out this tutorial if you need help with Git.
Step 2: The Core Jupyter Notebook
The heart of our automation is index.ipynb
. It consists of these key sections:
- Preamble and setup Import libraries and configure the working environment
- Web scraping logic Extract COVID metrics from the California Department of Public Health website
- Google Sheets API integration Store scraped data persistently in the cloud
- Data presentation and styling Format the historical data into an attractive web table
- Automation components Save the notebook and convert it to HTML for web deployment
- Automated Git deployment Commit changes and push the updated HTML to GitHub Pages
Let me walk through its key sections. At this stage, we are working inside the Jupyter notebook called index.ipynb
. The automation will then convert it to an index.html
file. GitHub Pages automatically looks for and serves index.html
as the default file when someone visits your site’s root URL (i.e., in this case, the published repo at walice.github.io/COVID-SB).
If you want to name it something else, say dashboard.html
, then visitors will need to go to https://<GitHub Username>/<GitHub Repo>/dashboard.html
in order to view the live webpage.
1. Preamble and Setup
Nothing special here, just importing the libraries we need and setting up our working directory.
2. Web Scraping Logic
The scraping logic uses Beautiful Soup to extract specific data from the California Department of Public Health website.
3. Google Sheets API Integration
The advantages of Google Sheets over a traditional database in this case are clear: it’s free, requires no server setup, offers a simple web interface for manual data inspection, and has excellent API support. For small-to-medium datasets like daily COVID metrics, it’s ideal - no need to spin up PostgreSQL or manage database connections.
First, you’ll need to authenticate with Google. Follow the instructions for setting up the Google Sheets API here.
This is the section of the notebook where we store our scraped data persistently.
4. Data Presentation and Styling
We style the data for web presentation:
Check out the nifty jupyterthemes package, where you can install notebook-wide themes. This is what the command ! jt -t grade3 -tf robotosans -cellw 1100
accomplishes in the first cell of the notebook (grade3
is the name of a theme).
5. The Key Automation Components
This is the crucial part that enables automation. The notebook saves itself and converts to an HTML file, index.html
, which is then automatically served by GitHub Pages at the URL https://<GitHub Username>/<GitHub Repo>
.
The first cell runs as a JavaScript magic cell inside the notebook to programmatically save the current file and ensure any changes from the automated run are written to disk (i.e., in this case, the Docker container inside the Google VM).
Then, the second cell runs nbconvert
to export the notebook as a clean HTML file, removing code inputs and some outputs using the flags --TagRemovePreprocessor
, and keeping only the Markdown cells along with the styled Pandas DataFrame for display.
6. Automated Git Deployment
Finally, the notebook commits and pushes changes automatically, making use of the fact that you can run a terminal command directly within a code cell by prefixing it with an exclamation mark !
.
Step 3: Papermill Execution Script
So far, we’ve created a Jupyter Notebook that will scrape the COVID metrics and display them in a nicely-formatted table for web viewing. This is all well and fine, but for the table to be updated with new data, the Jupyter Notebook needs to be run. This is where the package Papermill comes in.
Psst. If you use the Docker container from Step 1.2, the package papermill
will already be installed in your compute environment.
Create a file called execute_notebook.py
to automate notebook execution. Save it in your repo’s directory.
#!/usr/bin/env python3
import papermill as pm
import os
cwd = os.getcwd()
pm.execute_notebook(cwd + "/COVID-SB/index.ipynb",
cwd + "/COVID-SB/index.ipynb")
papermill
executes the entire notebook programmatically, running all cells in sequence and handling any errors gracefully. This is what allows us to automate the Jupyter notebook without manual intervention.
The execute_notebook()
command takes two arguments: input_file
and output_file
. Since the persistent address of the self-updating webpage will render index.html
, we want the file index.ipynb
to overwrite itself.
The logic inside the Jupyter notebook scrapes (Step 2.2) and appends (Step 2.3) new data to the table, and then converts (Step 2.5) the current iteration of the notebook into an index.html
file, before pushing itself (Step 2.6) to the GitHub repo that serves this index.html
file using GitHub Pages.
So there will only ever be one index.ipynb
and one corresponding index.html
at any time in the repo. Those files are dynamically updated each day (or whatever increment you choose).
Read on to learn how to set up the automatic deployment of the notebook using GitHub Pages and cron
.
Step 4: Cron Job Automation
If you run python execute_notebook.py
in your terminal, you’ll see that a fresh row with new data has been appended to the table and the file index.html
has been updated.
Now, we can automate this process with Linux’s built-in scheduler, cron.
When you run crontab -e
(in your terminal, not in a notebook cell), your user’s cron table opens up in a text editor. Here you can add commands, one per line, to be run at scheduled times. For example:
PATH="/home/alice_lepissier/.local/bin:/usr/local/bin:/usr/bin:/bin"
0 12 * * * cd /home/alice_lepissier && python3 execute_notebook.py
The syntax for cron
is a bit esoteric. What’s happening here:
- The first line sets the
PATH
variable so cron knows where to find your programs, namely the Python executable. Unlike a normal shell, cron runs in a stripped-down environment, so it’s safest to define this explicitly. - The second line is the actual job. The
0 12 * * *
part is cron’s time syntax, and corresponds to <minute> <hour> <day of month> <month> <day of week>. So we have:- 0 minute (0 = on the hour)
- 12 hour in 24-hour time (12 = noon)
- the
*
in the remaining fields mean “every day, every month, every weekday”
- After the schedule is set, the command changes directory to
/home/alice_lepissier
(which is whereindex.ipynb
lives in this tutorial; update your path accordingly) and executes the Python script withpython3 execute_notebook.py
.
This setup runs the notebook every day at noon, ensuring fresh data is scraped and keeping the webpage updated, without having to lift a finger.
If you want to add or edit automatic jobs in the future, simply crontab -e
again, and this will open up your personal cron table again.
Step 5: GitHub Pages Deployment
We’re almost there! We’ve got a self-updating dashboard in the form of index.html
- now let’s make it viewable in the browser with GitHub Pages.
The magic happens through GitHub Pages hosting:
- Repository setup: Your repository (e.g.,
COVID-SB
) contains the notebook and generated HTML. - Publish with GitHub Pages: Enable GitHub Pages for the repository, pointing to the
gh-pages
branch
GitHub Pages allows you to publish several repos: there is a main website at username.github.io
(more often than not, this will be your personal website). You can publish there from any branch. To publish additional repos, the source files (e.g., index.html
) must be pushed to a dedicated gh-pages
branch.
You just need to run this once (in the terminal of your VM): git checkout -b gh-pages
. This creates a new branch gh-pages
and switches you to it immediately. You can verify this by running git branch
which will list all local branches and highlight the one you’re currently on with *
.
You can see that the Jupyter Notebook cell responsible for deployment simply stages all changes, commits them, and pushes the current branch to GitHub (git add .
means “add everything in the current directory”). Because this deployment cell always pushes to whatever branch you’re on, you only need to check out gh-pages
once during setup. After that, every automated update will be committed to and pushed from this branch without any extra steps.
You also need to enable publishing by going into the settings of your repo on the GitHub website. The additional repo will then be accessible at username.github.io/repo-name
.
- Custom domain (optional): If you have a custom domain pointing to
username.github.io
, the URL for your published dashboard will beyourdomain.com/repo-name
.
Since my custom domain alicelepissier.com
points to walice.github.io
, the COVID-SB repository becomes automatically accessible at alicelepissier.com/COVID-SB
, displaying the index.html
file as a live webpage.
GitHub Pages offers excellent free hosting, with the important constraint that it only serves static content. This means it does not support server-side processing or dynamic web applications.
What we’ve done with the Docker container and the Google Sheets API is that we’ve essentially outsourced the execution and database management part of the pipeline, which lives persistently on a VM on Google Cloud. The file index.html
remains static and is pushed automatically to the repo, where GitHub Pages serves it as a live dashboard.
The Complete Automation Pipeline
Here is how all the pieces work together:
Cron job triggers execute_notebook.py
daily at noon
Papermill executes the Jupyter notebook automatically
Notebook scrapes fresh data from the CA Department of Public Health
Data gets stored in Google Sheets via API
Notebook converts itself to HTML with a styled table
Git commits and pushes changes automatically
GitHub Pages serves the updated HTML as a live webpage
The result is a completely hands-off system that maintains an up-to-date COVID tracking dashboard without any manual intervention.
Key Takeaways
This automation approach demonstrates several powerful concepts:
- Notebook-driven automation: Jupyter notebooks aren’t just for exploration; they can be production automation tools
- HTML conversion: The
nbconvert
command transforms notebooks into deployable web content - API integration: Google Sheets API provides simple data persistence without database setup
- Git automation: Automated commits enable continuous deployment workflows
- VM scheduling: Cloud VMs with cron jobs provide reliable, scheduled execution
The beauty of this setup is its simplicity - no complex deployment pipelines, just a notebook that knows how to update itself and push changes to a hosted webpage.
You can adapt this pipeline for any regular data scraping task, whether you’re tracking stock prices, weather data, or social media metrics. The key is combining notebook automation with web deployment to create self-maintaining data dashboards.
Project Files
Live dashboard GitHub repository
TL;DR. This is the complete Notebook file index.ipynb
:
-
You can also create the repository locally and then add GitHub as a remote origin later, but cloning from GitHub is simpler. ↩