Short Rant about Databricks Notebooks in Production

When I first discovered Data Science in 2018, I was programming in R. R and RStudio (now Posit) have created a beautiful infrastructure for working with notebooks: RMarkdown ¹.

Notebooks are great for data exploration and analysis: you can mix languages (R, Python, SQL) and text (markdown, LaTeX) in one document and render it into one HTML report. Because notebooks were so popular with data scientists, they were introduced as the default file to work in in Databricks.

Five years later, as a data engineer, I find notebooks rather annoying to work with. Especially in production. While there are some nifty features, like widgets that allow passing parameters to notebooks, I often ask myself “couldn’t this be a script?”

My main pain with notebooks comes from version control systems, though. Why can’t Azure DevOps render notebooks so that I can compare diffs? Instead it shows me loads of JSON output that I have to parse through to see what I want.

Now I default to Python scripts whenever I’m working in Databricks.

In fact, I wrote my master thesis completely in RMarkdown and LaTeX. ↩︎