Search
Traditional OLAP Cubes Replacements
Because you can make a one-to-one replacement, have a faster cloud backend, virtualize a Semantic Layer, or use a service cloud provider, I categorized the different technologies into the following groups
# Modern OLAP Systems
With OLAP-Technologies you replace your cubes one to one with another technology. Therefore, you don’t change anything on your current architecture, but replace your cubes with a modern big data optimized technology that focuses on the fastest query response time. See the Appendix for a comparison between modern OLAP Technologies.
# Cloud Data Warehouses
Another approach is that you change your on-premise Data Warehouse to a Cloud Data Warehouse to get more scalability, more speed, and better availability. This option is best suited for you if you do not necessarily need the fastest response time and you do not have - or petabytes of data. The idea is to speed up your DWH and skip the layer of cubes. This way you save much time in development, processing, and maintenance of cubes. On the other hand, you lose in query latency while you create your dashboards. If you mainly have reports anyway, which can run beforehand, then this is perfect for you.
# Data Virtualization
You may have many source systems from different technologies, but all of them are rather fast in response time, and you don’t run a lot of operational applications, you might consider Data Virtualisation. In that way, you don’t move and copy data around and pre-aggregate, but you have a semantic middle layer where you create your business models (like cubes), and only if you query this data virtualisation layer, it queries the data source. If you use, e.g., there you use Apache Arrow technology which will cache and optimise a lot in-memory for you that you have as well astonishing fast response times.
# Serviced Cloud and Analytics
Last option is to buy a Service Cloud Storage or Analytics vendors like Looker, Sisense or Panoply. These are very easy to use and create implicit cubes for you, meaning you just join your data together in your semantic layer of the respective tool and all the rest is handled by the tool, including the reporting and dashboard tools. In this way, you are more dependent on the individual vendor, it might also be more expensive (prices are not always transparent and hard to get), but you are very fast up and running.
# Additional featured Tools
If you go one step further, let’s say you choose one of the above technologies, you will most probably run into the need to handle intermediate levels in between. For example to prepare, wrangle, clean, copy, etc. the data from one to another system or another format especially if your working with unstructured data as these need to be mingled in a structured way at the end in one or the other way. To keep the overview and handle all these challenging tasks, you need an Orchestrator and some cloud-computing frameworks, which I will explain in the two following chapters, to complete the full architecture.
# Orchestrator
- Apache Airflow (created in Airbnb)
- Luigi (created in Spotify)
- Azkaban (created in LinkedIn)
- Apache Oozie (for Hadoop systems)
After you choose your group and even your technology you want to go for, you want to have an Orchestrator. This is one of the most critical tasks that gets forgotten most of the time.
# Why would you need this?
As companies grow, their workflows become more complex, comprising of many processes with intricate dependencies that require increased monitoring, troubleshooting, and maintenance. Without a clear sense of data lineage, accountability issues can arise, and operational metadata can be lost. This is where these tools come into play with their directed acyclic graphs (DAGs), data pipelines, and workflow managers.
Complex workflows can be represented through DAGs. DAGs are graphs where information must travel between the vertices in a specific direction, but there is no way for information to travel through the graph in a loop that circles back to the starting point. The building blocks of DAGs are data pipelines or following processes where the output from one process becomes the input for the next. Building these pipelines can be tricky, but luckily there are several open-source workflow managers available, allowing programmers to focus on individual tasks and dependencies:
# Cluster-computing frameworks
- Apache Spark (→ Databricks / Azure Databricks )
- Apache Flink (main difference to Spark is that Flink was built from the ground up as a streaming product. Spark added Streaming onto their product later)
- Dask (distributed Python with API compatibility for pandas, Numpy and scikit-learn).
To complete the list, we also need to address the computing frameworks, mostly known, for example, Spark. Spark or Cluster-computing frameworks are unified analytics engines for large-scale data processing, which means you can wrangle, transform, clean, etc., your data at a large scale with a lot of parallelization. This can also be used and started within the above-mentioned orchestrator tools.
Origin:
OLAP, what’s coming next? | ssp.sh
References: Apache Druid