How to scrape dynamically generated data with Python

How to beat a JavaScript code wall

7 min readMay 6, 2022

Intro
Pre-requisites
Page description
Proyect creation
Products urls scraping (Selenium)
Single product scraping (Scrapy)
Complete code
Conclusion

1. Intro

Have you ever tried to scrape data from a page, which remains hidden until the user perfoms certain action such as scrolling down or pressing a button? and your scraping method is not able to access it? If so, this tutorial is for you.

Here you will learn a technique to download dynamically generated data. It consists of using Scrapy, a library designed to scrape data on a large scale, and Selenium, a framework that simulates the human behavior into a browser.

2. Pre-requisites

- Environment preparation

First you will need to create a new environment. Using conda, on terminal type the following.

conda create --name scraper
conda activate scraper

You will also need the libraries that we mentioned above.

pip install Scrapy
pip install selenium

- Chrome driver installation

We must install the browser that Selenium will use for the simulations. In our case it will be Chrome. Get it in the link:

https://www.google.com/chrome/

Also, we need to install the Chrome driver in the link:

https://chromedriver.chromium.org/

3. Page description

The page we use as an example is one that sells products related to videogames:

https://www.microplay.cl/

Inside the page clicking (with the wheel) on the “VIDEOJUEGOS” category, it takes us to where we can select a product to buy.

If you click on any product, it takes us to its purchase page.

Where it shows us the name, price and description. We are going to collect that data for all the products.

4. Proyect creation

To initialize the project, we need to crate the files issuing the following commands on terminal.

scrapy startproject scraper
touch scraper/scraper/spiders/spiders.py

This creates the structure, from which we are going to modify those that are marked with an asterisk (*).

scraper
 ├── scraper
 │   ├── spiders
 │   │   ├── __init__.py
 │   │   └── spiders.py *
 │   ├── __init__.py
 │   ├── items.py *
 │   ├── middlewares.py
 │   ├── pipelines.py
 │   └── settings.py *
 └── scrapy.cfg

5. Products url scraping (Selenium)

We will extract all product urls and then we will pass them through a function that will collect the data.

In the file spiders.py import the Scrapy and Selenium libraries. And create a spider class with the initial parameter name, and the start_urls list.

Spiders are classes which define how a certain site (or a group of sites) will be scraped. And that name is used to identify each class, in case you have more than one scraper.

Now, in the page https://www.microplay.cl/productos/juegos, to get the urls, we need to identify their XML paths. To do that, right click on any product image and select “inspect”.

It will show us the HTML code. Check the path that leads to the selected item. In our example, the image corresponds to an img tag that lives inside an a tag.

There may be elements with the same path that have nothing to do with the products. A way to distinguish them is to choose a tag with an specific class, in order to avoid selecting other HTML elements.

In this case, the class card__item inside a div tag is repeated 20 times in the page, which corresponds to the 20 product cards loaded in the screen.

So the XML path would be:

".//div[@class='card__item']/a"

The category we are dealing with, has more than 20 items. But the rest are loaded dynamically, it means, there is a JavaScript function that loads them.

If we scroll to the bottom of the page, we will see a “VER MÁS” button. So we would like to simulate the action of a user pressing this button to get the remaining hidden data.

For this we need to get the XML path of the button.

It corresponds to an img inside an a. As it is highly likely that there are more than one element with the same tag, we choose a class that distinguishes it. In this case the load class.

Make sure it only appears in the elemnt you’re looking for.

The XML path would be:

.//a[@class='load']

In the spiders.py file, define the following variables.

Inside the MicroplaySpider class define the initial value for headless, to run the browser in the background.

Create a function that loads all the products. The function is a loop that first click the desired button with

wait.until(EC.element_to_be_clickable((By.XPATH, self.btn_xpath)))

and after waiting a while, it looks for the next button with

driver.find_element_by_xpath(self.btn_xpath).click()

The loop stops when it doesn’t find more buttons. In Python:

And also create the function that collects the urls and sends them to the scraper we are going to define in the next section. With the following

driver = Chrome(options=self.chrome_options)
wait = WebDriverWait(driver, 10)
driver.get(url)

it initializes the driver, initializes the waiting pattern, and loads the url into the driver. With

products = driver.find_elements(by=By.XPATH, value=self.card_xpath)for prod in products:
    prod_url = prod.get_attribute(“href”)
    yield scrapy.Request(prod_url)

we get all the urls and points them into the scraper. In python it looks like:

6. Single product scraping (Scrapy)

By clicking on any product, it takes us to the purchase page. For example:

https://www.microplay.cl/producto/audifono-over-ear-pro-g1-gamer-pokemon-otl/

Right clicking and inspecting the name, we can get its corresponding tag.

En el archivo items.py primero importamos lo siguiente.

The name belongs to an h1 tag. And we are identifying it with the content__ficha class inside the section tag. (Make sure it only appears once)

The XML path would be:

".//section[contains(@class, 'content__ficha')]/h1/text()"

Now repeat for the product price.

cd scraper/
scrapy crawl microplay -O data.csv

Price XML path:

".//span[@class='text_web']/strong"

For description.

Description XML path:

".//div[@id='box-descripcion']"

Then we obtain the paths for all the data we need. Inside the MicroplaySpider, write a function called parse. It will contain a loader, which is a data preprocesser class that we will define.

Note that we are collecting four features: name, price, url and description.

To create the ItemLoader class, inside items.py first import the necessary libraries.

With the loader we can preprocess, as much as we need, the data. It is recommended to use it only to clean the dirt generated by the page, it means, html tags, extra spaces or extra symbols.

Lastly, write the class specifying each function by which the data will be processed. The parameter

input_processor=MapCompose()

is responsible for composing a set of functions through which the data will pass. And the parameter

output_processor=TakeFirst()

will collect the first result that the scraper delivers. In Python:

We are almost ready, it only lasts to configure the settings.py file.

Inside settings.py, indicate the driver directory (it may be different for each PC).

Next, add some delay to prevent the scraper from loading too many items at the same time.

Finally, modify the throttle to avoid an overload. Let’s be gentle with the page.

7. Complete Code

- scraper/scrapers/spiders/spiders.py

- scraper/scrapers/items.py

- scraper/scrapers/settings.py

To run the scraper, inside the scraper/ folder, on terminal issue (It will take a few minutes to complete the exceution):

scrapy crawl microplay -O data.csv

And it will create a .csv file containing the data.

8. Conclusion

Now, you know how to beat the JavaScript code wall. So you have the ability to scrape almost any page you want.

Have a good time scraping 😊 .