Ask or search…
K
Links
Comment on page
🐙

Mimic a browser

A guide on how-to pretend you're a browser
So you're trying to scrape a page, but the html isn't available? You might need to pretend you're a browser, but you aren't sure how—worry not, playwright has your back!
Playwright is a way to interact with a browser through Python, and can enable you to see websites exactly how you would see them through your own browser.
In this example, we'll accomplish the same task as we did in Broken link, which was to find all the keyboard shortcuts in Databutton's docs, but this time we'll do it using playwright instead of requests and beautifulsoup.

Setup

To follow along with this guide, you will need:
  • A Databutton app
  • Add playwright to your packages
Visit the Package installation section for more info about how to install packages
Note that Databutton only supports a single playwright version, currently 1.37.0. This is due to playwright needing specific browsers for each version. If this is a problem, please reach out to a Databutler and we'll help you out!

Step 1: Pick your target

Before you can start scraping, you need to choose a website to steal data from... umm, we mean, legally retrieve information from 😄
⚠️: Jokes aside, it's important to only take data that you have the legal right to access.
In this example, we'll be using the following URL:
https://docs.databutton.com/howto/keyboard-shortcuts

Step 2: Open the URL in our browser

Next, it's time to open a browser and navigate to the URL. As you can see from the code below, we're first opening up a browser selecting chromium (you can also use Firefox or other browsers—see the playwright docs for more info) then creating a page, and finally navigating to that URL.
from playwright.sync_api import sync_playwright
url = "https://docs.databutton.com/howto/keyboard-shortcuts"
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
Some pages may take 1-2 seconds to load, and playwright has some easy-to-use methods to wait until the page is loaded, like page.wait_for_url. See here for more info.

Step 3: Extract the data you want

This is where the real fun begins!
Now that you have all the information in a neat and organized format, it's time to extract the data you want. Like a kid unsupervised in a candy store, you have the liberty to pick and choose the information that you want.
In this example, I'll get all the keyboard shortcuts from the URL mentioned above. We learned in Broken link that the text I'm after is inside h2 elements in the HTML.
Let's see how we can get this out using playwright:
for text in page.locator("h2").all_text_contents():
print(text)
Yay! This shows me the following output (at the time of writing this article):
cmd/ctrl+k (command palette)
cmd/ctrl+s (format and clean up code)
shift+enter (run code)
cmd/ctrl + . (show/hide sidebar)
Which is correct!

Step 4: Profit

Finally, it's time to enjoy the fruits of your labor. You can use the information you've gathered for your own purposes, whether that's to create a new and improved website or just to impress your friends.

Step 5: Wrapping up

Here's the code snippet that fetches the keyboard shortcuts from Databutton's docs. If you have installed playwright==1.37.0
from playwright.sync_api import sync_playwright
url = "https://docs.databutton.com/howto/keyboard-shortcuts"
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
for text in page.locator("h2").all_text_contents():
print(text)
browser.close()
Reminder, this guide is for educational purposes only. It's important to only take data that you have the legal right to access. Happy miming!