
As data engineers, we spend countless hours combing through logs - tracking pipeline states, monitoring Spark cluster performance, reviewing SQL queries, investigating errors, and validating data quality. These logs are the lifeblood of our data platforms, but parsing and analyzing them efficiently remains a persistent challenge. This comprehensive guide explores why data stacks are fundamentally built on logs and why skilled log analysis is critical for the data engineer’s success.
Throughout this article, we’ll categorize the various log types and formats you’ll encounter in your daily work, compare popular analysis tools, and most importantly, demonstrate practical, code-driven examples of parsing complex logs using DuckDB. You’ll see how DuckDB’s super fast parsers and flexible SQL syntax make it an ideal tool for log analysis across various formats including JSON, CSV, and syslog files.
For those working with larger datasets, we’ll also show how to analyze massive JSON log datasets at scale with MotherDuck, providing optimized query patterns for common log analysis scenarios. Whether you’re troubleshooting pipeline failures, monitoring system health, or extracting insights from operational metadata, this guide will help you transform log analysis from a tedious chore into a powerful competitive advantage for your data team.
Understanding Log Types and Their Purpose in Data Engineering
The questions would be, “What are we using logs for?”, “What information is there?”, and “What are these logs specifically for?” for data engineering workloads.
Categories of logs (application logs, system logs, etc.)
There are various logs. To better understand them, we need to know who is producing them. Let’s look at the categories of logs and the file formats they are usually in.
From a high-level perspective, we have different domains like application logs, system logs, error logs, and transaction logs:

As a data engineer, you’ll typically need to analyze several types of logs to monitor, troubleshoot, and optimize data pipelines and systems.
Besides there being many more logs (like Security, Perimeter Device, Windows or Endpoint Log and many more), these are the major logs you’ll encounter most of the time:
- Operational Logs:
- Application Logs: Track events within data processing applications, ETL tools, and analytics platforms, capturing pipeline execution details, transformations, and failures.
- System Logs: Monitor infrastructure health when run in Kubernetes or similar platforms for data workloads, helping diagnose resource constraints and system-level failures.
- Error Logs: Critical for troubleshooting failed data jobs and pipelines, identifying bottlenecks and failure points in workflows.
- Data Management Logs:
- Data Pipeline Logs: Changes and logs of orchestration tools documenting each step; essential for recapitulating what happened and finding bugs in case of errors.
- Transaction Logs: Track database operations and changes to ensure data integrity, critical for recovery and auditing.
- Audit Logs: Document changes to data schemas, permissions, and configurations, essential for compliance and data governance.
- IoT Logs: Capture data from Internet of Things devices and sensors.
- Security and Access Logs:
- Access Logs: Monitor who’s accessing data systems and when, important for security and compliance.
- Network Logs: Track data movement across systems, useful for monitoring transfer performance and detecting issues.
Different Types of Metadata
On a high level, we have different types of Metadata: social, technical, business, and operational. What we, as data engineers, mostly deal with are operational logs like job schedules, run times, data quality issues, and, most critically, error logs.

These operational data logs are called pipeline and execution metadata logs. They have certain formats and types (technical aspect), contain business terms in some cases, and have some social and business impact on the people and the organization.
Let’s now look at how these logs appear and what formats they use.
Data Types and Formats of Data Logs
What information does a log typically hold? Log files hold various data types, but two are always present: timestamp and some log, error or message.
Further columns could include a user, event type (like a specific action or occurrence that triggered it), or running application (e.g., started within Airflow). Others include system errors and any metadata that helps debug the errors.
These logs come in all shapes, styles, and formats. Most common are structured logs for metadata as JSON or key-value pairs and plaintext-based logs for execution sequences often in syslog-like formats. The JSON format has the advantage of a flexible schema, meaning columns can change each time, and the producers don’t need to think about types or fit into a pre-defined structure—leaving that job to the analyst later.
A range of different log formats is shown below.
Structured Formats
- JSON: Most common. JSON provides a hierarchical structure with nested objects and arrays, making it ideal for complex logging needs while remaining machine-parsable.
|
|
- CSV/TSV: Used for logging tabular data. This format is compact and easily imported into spreadsheet software or databases, though it lacks descriptive field names unless headers are included.
|
|
- Key-Value Pairs: Common in many logging systems. This format offers a good balance between human readability and machine parseability while remaining flat and avoiding the overhead of more structured formats.
|
|
Semi-structured Formats
- Syslog Format: A standardized format that includes a priority field, a header with information like timestamps and hostnames, and the actual message content. This format allows for centralized logging and easy analysis of logs across different systems and applications.
|
|
Common Event Format (CEF)
- CEF: Used in security and event management systems. This vendor-neutral format was developed by ArcSight and has become widely adopted for security event interchange between different security products and security information and event management (SIEM) systems.
|
|
.log
File
The .log-file is a common file extension used for logging data, but not a format itself. The .log
extension indicates that the file contains log information, while the actual content could be any of the previously mentioned formats.
Why Data Stacks Are Built on Logs
As data engineers, we have to deal with all of these various log types and formats because our data pipelines touch the full lifecycle of a business. From reading from many different source systems with potential network latencies or issues, to loading large tables that need more performance, to the whole ETL process where we transform data and need to make sure we don’t compromise granularity or aggregated KPIs with duplications or incorrect SQL statements.
Data stacks and data platforms are essentially built around logs. We can’t debug the data stack; the logs are our way to find the error later on. Software engineers can debug more easily, as they are in control of what the user can and can’t do. But data is different, constantly changing and flowing from A to B. We have external producers that we can’t influence, and the business and requirements are changing too.
On the consumer side, we have the visualization tools that need to be fast and nice looking. We have security, data management, DevOps on how we deploy it, the modeling and architecture part, and applying software engineering best practices along with versioning, CI/CD, and code deployments. All of this happens under the umbrella of data pipelines and is part of the Data Engineering Lifecycle. On each level, we can have different data logs, performance and monitoring logs, data quality checks, and result sets of running pipelines with their sub-tasks.
That’s why our data stacks run on metadata, and they are as important today as they were two decades ago. However, with more sophisticated tools, we can now analyze and present them more efficiently.
success
written in the log. This is possible with non-declarative and UI-first solutions too, but it is more natural for the code-first solution.Log Analysis Use Cases and When to Use Log Files
What are we doing when we analyze logs? Data engineers typically focus on several key use cases:
Debugging is the most common use case. As we can’t simply use a debugger with complex data pipelines, we must log our way through problems. Good logs should identify errors clearly. Since we work with complex business logic most of the time, on top of the technical stack, this requires significant expertise from data engineers and is where we can spend much of our time. But the better the logs, the less we need to search, and the more we can focus our time on fixing the bugs.
Tracing helps pinpoint the origin of errors in pipelines with many sub-tasks, while performance analysis uses logs from BI tools or orchestrators like dbt to identify bottlenecks.
Error pattern analysis examines changes over time to prevent recurring issues.
For monitoring, we often load logs into tools like DataDog, Datafold, ELK Stack, or InfluxDB, standardize metrics with Prometheus, and visualize using Grafana. For more, see the next chapter.
Tools and Solutions for Effective Log Analysis
The tools we use to analyze the logs have changed over time and have become more numerous but also better in quality. Traditionally, we had to do all the log reporting manually. More recently, however, we have monitoring and observability tools with dedicated log analyzer capabilities included. These vary in their specific use cases, but all of them analyze some kind of log.
Here’s an overview of some of the different tools, categorized in these two domains: log and monitoring/observability, and the degree of automation and manual effort required. You also see the green mark if the tool is open-source or not.

These tools fall into several categories:
- Auto-profiling solutions like Bigeye, Monte Carlo, and Metaplane offer automated monitoring with unique features ranging from ML-driven alerts to enterprise data lake integrations
- Pipeline testing tools such as Great Expectations, Soda, and dbt tests provide granular validation within data workflows
- Infrastructure monitoring platforms including DataDog and New Relic focus on system health and resource utilization
- Hybrid solutions like Databand and Unravel unify infrastructure monitoring with data-specific observability
DuckDB as the Ultimate Log Parser?
But how about using DuckDB as a log parser? Let’s imagine we have all the logs parked on an S3 storage or somewhere in our data warehouse. DuckDB is a very efficient tool for quickly analyzing the overall status.
Whereas the above tools are doing real-time monitoring mostly, analyzing what is happening every second and minute, DuckDB can be used to have analytics for the overall state. We can have advanced log analysis techniques such as:
- Time-series analysis of log data
- Combining logs from multiple sources
- Creating dashboards and monitoring systems
DuckDB is the ultimate log parser. It can run with zero-copy, meaning you don’t need to install or insert logs into DuckDB, but you can read from your data lake in S3, from your Snowflake Warehouse, and from your servers via HTTPS server, all within a single binary.
DuckDB has one of the fastest JSON and CSV parsers. This comes in very handy, as we learned that most logs are in these exact formats. The ability to query multiple file formats with consistent SQL syntax and the local processing capabilities that reduce network overhead are just two other big advantages that make DuckDB a great tool for log parsing.
With the extension of MotherDuck, we can simply scale the log analysis in case DuckDB can’t handle it, when we want to share quick analytics with a notebook, or when we want to share the data as a shared DuckDB database. You can scale up your parser without making the code more complex, just using a different engine with the same syntax and understanding as DuckDB itself.
Practical Log Analytics: Analyzing Logs with DuckDB and MotherDuck
Below, we have a look at two datasets: the first one with various formats and the second real-life JSON from Bluesky to benchmark larger log analytics.
Parsing Various Log Formats with DuckDB
Before we go any further, let’s analyze some logs to get a better understanding of what logs are and how they can look. The idea is to analyze completely different log files to understand how to parse them all with DuckDB using various strategies.
Practical Log Analytics: Analyzing Logs with DuckDB and MotherDuck
Below, we have a look at two datasets: the first one with various formats and the second real-life JSON from Bluesky to benchmark larger log analytics.
Parsing Various Log Formats with DuckDB
Before we go any further, let’s analyze some logs to get a better understanding of what logs are and how they can look. The idea is to analyze completely different log files to understand how to parse them all with DuckDB using various strategies.
Parsing one big Apache Logs: From Unstructured Text to Actionable Insights
In this first example, we analyze one large log file with 56,481 lines and 4.90MB called Apache.log
(it is compressed in .gz
). The size is small, but the log is semi-structured like this, where we have the timestamp, error type, and message. There are also outliers we need to deal with:
|
|
Remember, this is a good opportunity to use an LLM. If you give it the schema description with the first 100 lines, it can do an excellent job of helping us create complex RegExp patterns to parse otherwise randomly looking log files such as the Apache.log
above. That is exactly what I used initially to generate this:
|
|
If we run, we can check if the RegExp works, and can confirm with the result looking like this:
|
|
Let’s now count the errors by client IP (when available) to get some insights. To do that, we create a table based on the above query to reuse and simplify the following query:
|
|
Then we can query the IP with the most errors:
|
|
The result in a couple of seconds:
|
|
Handling Big Data Logs: HDFS Example
Another example is the HDFS Logs that are available on this same GitHub repo. Let’s look at how DuckDB can handle HDFS logs, which are common in big data environments.
This dataset is 1.47GB in size and has 11,175,629 lines, but we only look at the one HDFS.log that has more than 11 million rows. If you want to follow along, download the file and unzip it. I unzipped it on ~/data/HDFS_v1
.
Let’s now create a table again to simplify our querying:
|
|
If we check, we see that we have 11.18 million logs—querying this directly takes about 3 seconds on my MacBook M1.
|
|
If we plan to query that data often, we could create a TABLE
again, as shown above. Another interesting query is to analyze block operations in these HDFS logs with this analytical query over our logs:
|
|
The result looks something like this - it reveals the distribution of block operations across different HDFS components, with the NameSystem managing the most operations while DataNode components handle various aspects of data transfer and storage:
|
|
Or we identify potential failures with this query:
|
|
The result looks something like this:
|
|
You can see, with some simple queries, you can either run the query directly on your files, or if you have many files, it’s recommended to just create a table, or even unnest some JSON structure to improve query performance. More on this later.
JSON Log Analytics with Bluesky Data: Scale-Up If Needed
As DuckDB is an analytics tool, besides just parsing logs, we can also create analytics dashboards. In this demo, we do two use cases: first, analyzing the logs directly sitting on S3, with no normalization or unnesting beforehand, once with DuckDB and once with MotherDuck.
Then we unnest JSON files and store them as struct or flat tables, and see how this affects the speed. For more complex log analysis, let’s examine JSON-formatted logs from Bluesky (real-world data), and see some benchmarks when it would make sense to use MotherDuck.
We can query the data like this quite easily:
|
|
The result comes back in 5-10 seconds for one single file:
|
|
So we can imagine that loading all of the 100 million rows (100 files) or even the full dataset of 1000 million rows would need some different mechanism. But for loading the 100 million rows and 12 GB worth of data, it can’t run on my Macbook M1 Max anymore.
I tried downloading the 100 million locally and running the query for all or some of the files. But it didn’t finish in a useful time. You can see, that DuckDB uses most of your resources, specifically the CPU (shown in btop
):

And in MacOS activity monitor with full CPU usage too:
Here is the syntax to load partially (a couple of files) or load them all:
|
|
Scaling Beyond Local Resources with MotherDuck
For this job, I used MotherDuck. It scales nicely without requiring syntax changes or purchasing a new laptop 😉. Plus, I can share the data set or the collaborative notebook. We can use MotherDuck to parse logs at scale.
Let’s check if the data is queryable directly via S3:
|
|
Performance Optimization: Pre-Materializing JSON Data
This works, but is still quite slow (29.7s
) as we need to download the larger Bluesky data over the network. And if we want to do some analytical queries and GROUP BY on top of it, we need to have a different strategy. That’s where materialization into a simple table comes into play. And because we work with JSON data, if we flatten and unnest the JSON, we can do even faster analytics queries.
This is good practice and will always speed up drastically on DuckDB locally and on MotherDuck. For example, we can do this:
|
|
This query took 8m 5s
to create on MotherDuck as it had to load the full data from S3 to MotherDuck. Once we have it in, it’s fast. This is always a tradeoff - when you just want a live view without materializing, you can also filter more narrowly and run it directly without the table created first.
read_json('s3://clickhouse-public-datasets/bluesky/*.json.gz')
, I used the above list notation to read the file_0010-file_0039.json.gz
.Practical Analytics: Real-world Query Example
Let’s now analyze analytics queries like event types with:
|
|
The result looks something like this:
|
|
And time-based analysis (events per hour) queries, or basically any query:
|
|
The result:
|
|
Or find the most active users:
|
|
Here’s the user identified:
|
|
That’s it; these are some tricks and examples of how to analyze logs, from simple logs to large JSON data sets. Please go ahead and try it yourself with your own data logs, or follow along with the GitHub repos shared in this article.
What Did We Learn?
In wrapping up, we saw that logs are not as simple as we think and that data engineering platforms are fundamentally built on logs. We can use DuckDB for parsing logs and MotherDuck for parsing logs at scale with collaboration and sharing features.
Log files provide crucial visibility into every aspect of our data stack. From application errors to performance metrics, from transaction records to security events, these logs form the digital breadcrumbs that allow us to trace, troubleshoot, and optimize our data platforms.
The power of DuckDB as a log parser lies in its flexibility and performance. We’ve seen how it effortlessly handles different log formats—from simple text files to complex JSON structures—without requiring data to be pre-loaded into a database. The ability to query logs directly where they sit, whether on S3, in Snowflake or on local storage, makes DuckDB an incredibly powerful tool for ad hoc analysis.
For larger-scale log analysis, MotherDuck extends these capabilities, allowing teams to collaboratively analyze massive log datasets without being constrained by local hardware limitations. The ability to seamlessly scale from local analysis to cloud-based processing with the same familiar syntax makes this combination particularly powerful for data teams of all sizes.
We’ve learned that effective log analysis is not only about which tools to use, but about understanding the structure and purpose of different log types, knowing when to materialize or unnest data for performance, and being able to craft queries that extract meaningful insights from what might otherwise be overwhelming volumes of information.
Knowing how to analyze logs straightforwardly and efficiently is a competitive advantage in today’s data-driven world. It allows data engineers to spend less time troubleshooting and more time building reliable data platforms that drive business value.
Full article published at MotherDuck.com - written as part of my services