Search
Analytics API
The struggle of having a single performant, secure, and reliable data endpoint is real. Especially to select the exact metric and dimension with the correct number everybody agrees on, right?
A unified, analytical API solves this.
# What is an Analytics API?
Now that we’ve seen what GraphQL can do, we’ll discuss building an API that takes data engineering to the next level, which I call the Analytics API in this article. The API will empower all stakeholders to use one single source of accessing analytics data in a consistent and decoupled semantic way (and if you know a better name, please let me know!).

The Analytics API architecture with the single endpoint with GraphQL | Image from
Building an Analytics API with GraphQL: The Next Level of Data Engineering? | ssp.sh
The Analytics API consists of five main components, where GraphQL is the natural fit for the gateway API and the query interface. Besides that, the SQL Connector connects legacy or traditional BI systems that talk SQL natively. The metrics or business logic, also called Metrics Store or Headless BI stored in a Metrics Store. Suppose you’re in a large organization with a lot of variety. In that case, it’s helpful to have a data catalog that helps discover your data and add owners, comments, ratings, and other metadata to datasets to navigate between them. The orchestrator updates your content in the data stores consistently and reliably. More about each component a bit later.
# Components of an Analytics API
Let’s now look into each component in more detail and what each does.
# API and Query Engine
The first component of the Analytics API is the interface and the Query Engine. This interface is the single GraphQL endpoint that all tools access. Call it a proxy, router or gateway, which forwards every query, mutation or subscription to the correct service or pipeline.
The query engine helps if you have central calculated measures or any data stores that do not speak SQL; it translates the GraphQL query to that specific query language. A critical separation from the SQL Connector uses more advanced, general-purpose patterns to query data. E.g., instead of SELECT COUNT(DISTINCT userid) AS distinct_users FROM customers we would be more generalized with:
|
|
For that, we need an intermediate layer that translates the generic query into an actual SQL query: the Query Engine.
I hope you notice the benefits and the small
revolution for all business intelligence engineers here. We have one definition instead of writing long and complex queries for all data stores with slightly different syntax. And rather than defining the metrics such as distinctUsers in various places, we store it once and apply it to all systems. No need to worry if you got the latest version or if anyone changed the calculation. More on how you store one metric definition centrally in the next chapter.
We’re seeing more abstractions emerging in the transform layer. The metrics layer (popularised by Airbnb’s Minerva, Transform.co, and MetriQL, feature engineering frameworks (closer to MLops), A-B Testing frameworks, and a cambrian explosion of homegrown computation frameworks of all shapes and flavours. Call this “data middleware”, “parametric pipelining” or “computation framework”, but this area is starting to take shape. From How the Modern Data Stack is Reshaping Data Engineering.
As seen on the Analytics API image above, it integrates through GraphQL with the other components to either read data from the metrics and data catalog store or trigger an update through the orchestration. There is no integral tool besides the Headless BI tools, which implement only certain parts. In The Recent Hype Around Headless BI chapter, you can find more about them.
# Metrics Layer
See more Metrics Layer.
# Data Catalog
See more Data Catalog.
# Orchestration
The Orchestration part is where most of the business logic and transformations land. Instead of building everything into the Query Engine directly in GraphQL, it’s better to use a proper tool to reuse code and integrate it more effectively.
I see Dagster as the modern business rule engine, where you express the logic in Python code, making it testable and scalable compared to no-code/low-code approaches. Dagster offers tons of tools, such as resources, to capture reusable code, including connecting to Druid, creating a delta table, and starting a Spark job, all of which are used in the pipelines. Another building block in the Analytics API is an Op, which condenses your business logic as functional tasks within a data pipeline. It is well-defined, with typed inputs and outputs, and uses context such as the above resources, making it easy to run a Spark job as part of an op.
The integrations within the Analytics API are with GraphQL, as Dagster has one built in. Dagster uses this interface to query various metadata, start pipelines/sensors (via mutations), or subscribe to specific information. Side-note: This does not come out of thin air, as the founder of Dagster Nick Schrock is the Co-Founder of GraphQL :wink:. Instead of running and using the Dagster UI, we use that interface for developers and abstract it away with the Analytics API.
# SQL Connector
SQL is the data language besides Python, as elaborated in earlier articles. That’s why we need to provide an interface for that as well. The SQL Connector integrates all BI, SQL-speaking, or legacy tools. For example, the connector mainly implements an ODBC or JDBC driver with Avatica built on Apache Calcite used by Apache Druid. With that, have a way to interface with ANSI SQL, including all our metrics and dimensions in the metrics store, with no additional effort on the accessing side if the tools talk SQL.
# Further Reads
Read my full article at Building an Analytics API with GraphQL: The Next Level of Data Engineering?.
Other related notes:
Origin:
Building an Analytics API with GraphQL: The Next Level of Data Engineering? | ssp.sh
References: Open Semantic Interchange (OSI) and Semantic Layer