🧠 Second Brain

Search

Search IconIcon to open search

Change Data Capture (CDC) - Real-Estate Project

Last updated Feb 16, 2025

CDC is a powerful tool and especially in cloud environments with event-driven architectures. I used it to minimize the downloads of already downloaded properties. Besides existing open-source CDC solutions like Debezium, I implemented my own simple logic to detect the changes. Also because I have no access to the source-OLTP database where the properties are stored which you’d need.

I accomplish the CDC by creating two functions. The first one lists all properties to certain search criteria and the second one compares these properties with existing once. How am I doing that? Primarily, I create a fingerprint from each property that will tell me if the one is new or already exstinging. You might ask why I'm not using the unique property-ID? The reasons are I didn't just want to check if I have the property or not. As mentioned in the intro I also wanted to check if the seller lowered the price over time to be able to notify when the seller can't get rid of his house or flat. My fingerprint combines the property-ID and the selling price (called normalized_pricein my data). One more benefit if more columns getting relevant, I could just add them to the fingerprint and my CDC mechanism would be extended without changing any other code.

To have the relevant selling price for each property-ID I will scrape them separately from the website same as the IDs themselves. You can check that code in solid_scraping.py. The function is called list_props_immo24 which returns all properties as a data frame for my search criteria.

The logic for CDC happens in get_changed_or_new_properties in solids_spark_delta.py where I compare the existing once in my delta table with the new coming from list function above. As Delta Lake supports an SQL-API I can use plain SQL to compare the two with this simple SELECT-Statement:

1
2
3
4
5
6
SELECT p.id, p.fingerprint, p.is_prefix, p.rentOrBuy, p.city, p.propertyType, p.radius, p.last_normalized_price
  FROM pd_properties p
  LEFT OUTER JOIN pd_existing_props e
    ON p.id = e.propertyDetails_id
 WHERE p.fingerprint != e.fingerprint
	OR e.fingerprint IS NULL

Origin: Building a Data Engineering Project in 20 Minutes
References: Python SQL Debezium OLTP What is CDC (Change Data Capture)