Ask or search…
K
Links
Comment on page
🍜

Requests + BeautifulSoup

A guide on how-to scrape websites
Here's a guide on how to use requests and BeautifulSoup to scrape websites.

Setup

To follow along with this guide, you need to have:
  • A Databutton app
  • The following packages installed (see image below).
    Visit the Package installation section for more info about how to install packages
Databutton only supports playwright==1.29.1. This is due to playwright needing specific browsers for each version. If this is a problem, please reach out to a Databutler and we'll help you out!

Step 1: Pick your target

Before you can start scraping, you need to choose a website to steal data from... umm, we mean, legally retrieve information from 😄
⚠️: Jokes aside, it's important to only scrape websites you have the legal right to access.
In this example, we'll be using the following URL:
https://docs.databutton.com/howto/keyboard-shortcuts

Step 2: Send the request

Next, use the requests library to send a request to the website's server, asking for all the juicy information you want. Think of it like breaking into a cookie jar, but with a computer instead of a crowbar.
You can do this as a job in Databutton:
import requests
url = 'https://docs.databutton.com/howto/keyboard-shortcuts'
response = requests.get(url)
html_content = response.content

Step 3: Parse the response

Once you've received the response, it's time to parse the information with BeautifulSoup. You can think of this as sorting through a pile of coins to find only the gold ones.
Below we have a Beautiful Soup structured way of finding information in a large HTML page!
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
print(soup.prettify())

Step 4: Extract the data you want

This is where the real fun begins!
Now that you have all the information in a neat and organized format, it's time to extract the data you want (aka get your 🍪 from the 🫙). Just pick and choose the information that you want.
In this example, I'll get all the keyboard shortcuts from the url mentioned above. After a quick inspection in Chrome, I learned that all the headers are represented as h2 elements in the HTML.
BeautifulSoup gives me a neat way to get all those h2's:
headers = soup.find_all("h2")
# This is a list, so let's iterate through and print out the text
for header in headers:
print(header.text)
Yay! This shows me the following output (at the time of writing this article):
cmd/ctrl+k (command palette)
cmd/ctrl+s (format and clean up code)
shift+enter (run code)
cmd/ctrl + . (show/hide sidebar)
Which is correct!

Step 5: Profit

Finally, it's time to enjoy the cookies of your labor (we will stop with this analogy now 😄). You can use the information you've gathered for your own purposes, whether that's to create a new and improved website or just to impress your friends.

Step 5: Wrapping up

Here's the code snippet that fetches the keyboard shortcuts from Databutton's docs. If you have both requests and beautifulsoup4 installed, this should just work in your app!
import requests
from bs4 import BeautifulSoup
url = "https://docs.databutton.com/howto/keyboard-shortcuts"
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, "html.parser")
headers = soup.find_all("h2")
for header in headers:
print(header.text)
⚠️: Reminder, this guide is for educational purposes only. It's important to only scrape websites that you have the legal right to access. Happy scraping!