Use notebooks to open up the data silos
Data Notebooks popularized by Jupyter Notebooks are the centralized IDE inside a browser for doing collaborative work.
Some notebooks that exist today in different evolutions:
- Notebooks that are popularized and in heavy use today.
- Newer versions of Jupyter Notebooks with more integrated features and powerful cloud behind
- Spreadsheets (new category and more similar to notebooks as they can run python and JS inside cells as well)
With notebooks everyone has access to the data, can explore and write analytics with all the advanced statistical and data science libraries. It’s straightforward to set-up, no need to install a local environment or IDE, it works with your browser of choice. You can spread your visualisations by quickly sharing a link. You can also work together in the same notebook. With integrated markdowns, you explain the whole story and thought-process behind the numbers to people naturally.
The downside of notebooks is that your code will be duplicated and scattered around the place. A little bit like everyone having their own Excels, everyone will start their own notebooks (but of course different data-wise, as the data is stored centrally). I suggest you set strict and clear rules or governance around the notebooks and starting to integrating stable notebooks into your common data pipelines.
For a fast and smooth transition, you should have a look at Papermill which allows you to parameterise, execute and analyse notebooks. Even a step further is to use Dagster and either use the notebooks as a step of your pipeline (dagster integrates with papermill) or incorporate it into your pipeline fully with dedicated solids. This way, you avoid code duplication and reusing solids.
See also Machine Learning part – Jupyter Notebook.