🧠 Second Brain

Search

Search IconIcon to open search

Getting the Data – Scraping

Last updated Feb 9, 2024

Disclaimer

Everything shown here is demonstrated for learning purposes only. Before you begin, make sure you don’t violate the copyright of any website and always be friendly when scraping.

The internet has an infinite amount of information, that’s why scraping is valuable to know even though less known for data engineers. As a first step, we are getting the properties from a real estate portal. In my case, I chose a Swiss portal, but you can choose anyone from your country. There are two main Python libraries to achieve this, Scrapy and BeautifulSoup. I used the latter for its simplicity.

My initial goal was to scrape some properties from the web page by determining how many search page results I got and scrape through each property from each page. While testing around with BeautifulSoup and IPython — IPython is an excellent way to initially test your code — and asking my way through StackOverflow. I found that certain websites provide open APIs which you find in the documentation of their website or with the interactive developer tools (F12) explained below. This will save you from scraping everything manually and, therefore, also producing less traffic on the site of the provider.

If you want to check if another website has an open API, you can search for an HTTP request by simply clicking F12 and switching to the network tab to check requests that your browser sends when clicking on a property.


An example with Webbrowser Brave (Chrome like)

To get started with web-scraping, it helps when you know some basic HTML. To get an overview of the site you would like to scrape your data from, use the above interactive developer tools. You can now inspect in which <table>, <div> or id, class, or href your information is found. Most websites with valuable data have ever-changing IDs or classes, which makes it a bit harder to just grab the specific data you need.

Let’s say we want to buy a house in Bern, the capital of Switzerland. We can e.g. use this URL with this search term: https://www.immoscout24.ch/en/house/buy/city-bern?r=7&map=1. R, in this case, is the radius around Bern and map=1 means we only want properties with a price tag. As mentioned, we need to find out how many pages of results we have. We can see this information is at the very bottom. A hacky example that worked for me is I searched all buttons on the page and chose the one smaller and equal to three which equals me as the last page number two as of today. An example code to scrape how many pages of search results we have:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from bs4 import BeautifulSoup
import requests

url = 'https://www.immoscout24.ch/en/house/buy/city-bern?r=7&map=1'

html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")
buttons = soup.findAll('button')

p = []
for item in buttons:
    if len(item.text) <= 3 & len(item.text) != 0:
        print(item)
        p.append(item.text)
if p:
    lastPage = int(p.pop())
else:
    lastPage = 1

print(lastPage)

## -- Output --
<button aria-disabled="true" aria-pressed="true" class="bkivry-0 as6woy-0 c2ol4x-0 hXMMbP" disabled="" type="button">1</button>
<button aria-disabled="true" class="bkivry-0 as6woy-0 c2ol4x-0 hXMMbP" disabled="" type="button">2</button>
2

To get to a list of property IDs I assembled a search-link for each search where I grabbed links that stared with “en/d” and had a number in it. Some sample code below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import re
url = 'https://www.immoscout24.ch/en/house/buy/city-bern?pn=1&r=7&se=16&map=1'

ids = []
html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")
links = soup.findAll('a', href=True)
hrefs = [item['href'] for item in links]
hrefs_filtered = [href for href in hrefs if href.startswith('/en/d')]
ids += [re.findall('\d+', item)[0] for item in hrefs_filtered]

print(ids)

## -- Output --
['6331937', '6330580', '6329423', '6298722', '6261621', '6311343', '6318070', '6313553', '6317089', '6306531', '6305793', '6296041', '6294327', '6284892', '6283242', '6282624', '6274328', '6251376', '6237199', '6237144', '6231495', '6224144', '6223578', '6209944']

Complete code above you can find on GitHub on solids_scraping.py in functions called list_props_immo24 and cache_properies_from_rest_api.

A new tool is also Selenium.


Origin: Building a Data Engineering Project in 20 Minutes | ssp.sh
References: Python