Now let’s dive into how the web scraping is actually done. In python, we use a module called, bs4 to acquire BeautifulSoup which comes as a part of it. In addition, we do need requests module to actually send the HTTP request. Once you install both bs4 and requests modules, you can import them as shown below,
First, you have to make the request to the intended URL and get the response body back. There are a lot of sites that are made to just scrape and Quotes to scrape and Books to Scrape are few to name some. In this demo, I will be using Quotes to Scrape site. You need to make the request and save the response like this,
The response we get back is a String so we can’t access or manipulate it in the way we want. The next step is to create an instance of BeautifulSoup by passing the response you got.
If you inspect Quotes to scrape using devtools, you can see details of each quote is placed under a div with the class of ‘quote’
After identifying the CSS selectors of the container for each quote, all you have to do is to select them. It can be done like this,
.select() returns an array even if there’s only one result. In this case we have an array of divs(quotes). We can loop over the quotes array we just created and extract each quote and author or anything else we want. And then finally append each record as a dictionary to another array for the ease of use.
Quote text is located inside a span with a class of ‘text’. Author name in a small tag with the class ‘author’. We use .get_text() method to obtain innerText of an element. With the tailing print command you can see all the quotes and authors of the first page of the website in an array.
But more often than not, we need to extract data from all available pages of the website right? We want to click next button again and again and get all the data available in all pages rather than just one page.
This is how it is done:
First we need to locate the next button or similar navigation to the next page in the current page. Then we select that next URL, and keep on grabbing data until that next URL is not there anymore (That means we’re on the last page).
Final code to grab all of the available quotes look like this,
In the above example, I have wrapped everything in a function for the sake of re-usability. Also, by introducing function we can repeat the scraping only when we need to by calling the function rather than running the scraping process every single time we run the file.
Note that I have added 1 second gap between each request in the try block by adding sleep(1) to not to be harsh on the server. This is a good practice when we’re doing web scraping.
You can also save the result to a csv file if you prefer afterwards like this,
First you need to import ‘DictWriter’ from ‘csv’ module:
Then you can write a little function which will take in the quotes array we generated, and write it to a csv file.
Finally, you can execute both functions like so to generate the csv file with all the quotes and authors.