🧠 Second Brain


Search IconIcon to open search

Use notebooks to open up the data silos

Last updated Feb 9, 2024

Data Notebooks popularized by Jupyter Notebooks are the centralized IDE inside a browser for doing collaborative work.

Some notebooks that exist today in different evolutions:

  1. Notebooks that are popularized and in heavy use today.

With notebooks everyone has access to the data, can explore and write analytics with all the advanced statistical and data science libraries. It’s straightforward to set-up, no need to install a local environment or IDE, it works with your browser of choice. You can spread your visualisations by quickly sharing a link. You can also work together in the same notebook. With integrated markdowns, you explain the whole story and thought-process behind the numbers to people naturally.

The downside of notebooks is that your code will be duplicated and scattered around the place. A little bit like everyone having their own Excels, everyone will start their own notebooks (but of course different data-wise, as the data is stored centrally). I suggest you set strict and clear rules or governance around the notebooks and starting to integrating stable notebooks into your common data pipelines.

For a fast and smooth transition, you should have a look at Papermill which allows you to parameterise, execute and analyse notebooks. Even a step further is to use Dagster and either use the notebooks as a step of your pipeline (dagster integrates with papermill) or incorporate it into your pipeline fully with dedicated solids. This way, you avoid code duplication and reusing solids.

See also Machine Learning part – Jupyter Notebook.

# Orchestrating

Notebooks can be orchestrated with Papermill (included in Dagster), or newer tools such as orchest.io and others(Data Orchestrators.

Origin: “Use notebooks to open up the data silos.”, Machine Learning part – Jupyter Notebook
References: Dagster Jupyter Notebook Databricks
Created 2023-02-27