The Basics of Web Scraping Using Python.

6 min readJan 29, 2021

Learn to how harvest information, from any website.

Disclaimer: This article is only for educational purposes. I do not promote nor encourage anyone to scrape websites.

The internet is filled with vast amounts of information in the form of data. This data can be found on websites and other sources. Websites are the most common place where most of the information is contained as content. There are a number of ways to access and retrieve data on the internet. The popular one is the usage of application programming interface or simply API.

API’s act as a middle man or connection between the source data and the end user. They retrieve information like a waiter in a restaurant. The food is prepared by the chef in the kitchen and the menu is distributed to the customer by waiter which is the API. The customer is can now use the menu to choose the type of food they desire. The waiter completes an order and after a few minutes the food is delivered to customer which is the end user. However, API’s can be limiting since they only allow you to access information, they are familiar with.

On the other hand, we have web scraping which is an alternative to API. Web scraping allows you to scrape data or content you are interested in on a website. This technique has gained popularity over the years although it still prohibited in some countries. With web scraping you can virtually extract information from any web page with relative ease. But a number of challenges can arise when the structure of the website you trying scrape is not well understood.

In this article, web scraping a technique used to retrieve information from a website is explored. The technique is applicable to any website, a public web page is used as an example to retrieve useful data about the weather.

Exploring web page structures:

A number of frameworks are used to develop websites from java scripts to Django python. Each of the frameworks present a number of complexities to their overall architecture. Within python there exists a number of libraries that can be used to extract data from a web page. Each of the libraries have different capabilities that enable them to deal with a unique set of challenges.

Before extracting data from a website, one needs to understand the structure and architecture of the web page. To explore web pages, the inspect option on a web browser is used. The South African Weather Service website is used as an example to explore the structure using the inspect option. See link below,

SAWS Home - WeatherSA Portal

For all official information and updates regarding COVID 19, visit the South African Department of Health website at…

www.weathersa.co.za

For users on Google chrome, to view the html format of the webpage you simply use the inspect option by right clicking anywhere on the webpage. An example of html format is shown below.

Similarly, for Safari users the inspect option can be accessed by right clicking anywhere on the webpage. Also, for Safari make sure the automation option is enabled.

On both the Google Chrome and Safari browsers the South African weather service web site can be observed in html format. Within the html body there is the body class, script, div id and div class. The weather forecasts data is sitting in one of those classes.

For instance, if you are interested in the weather alert information, you simply highlight the section and inspect it. In the inspect mode the location of the weather section is highlighted in blue as shown below on the screen print,

The inspect approach is the easiest way to view web pages in their html format. But there are libraries within python that can be used for a similar purpose. For example, to explore the web page without using the inspect approach the following libraries are needed,

The request is used to get the web page,

The request needs to output 200 which indicates whether a request has succeeded. Any output that is not 200 is considered an error.

Using beautiful soup, the html content of the html is displayed, as shown below

The different sections of the web page can be viewed with relative ease. Isolating certain sections of the html can be achieved by specifying the class name. For instance the row class is analysed since it contains important information.

The row contains the different geographical locations and their corresponding weather forecasts. The screen print below displays the city name and other revelant information.

Extracting content from a Webpage:

In the above section, it was shown how content can be extracted and explored using different approaches. Through the use of inspect or using a python library to explore the web page. In this section further steps are taken to extract the information. Beautiful soup is used to extract the data.

To use beautiful soup is fairly easy you just need to import it in the pytthon environment as shown below,

The data that is needed is the geographical locations and their corresponding weather forecasts. The data is sitting on the row class as observed in the exploration section.

The screen print below shows the process of gathering the locations.

Once the target data is captured, the next step is to clean the text data. But before the data is clean it needs to be stored in a list.

By creating a list the data can be cleaned effectively using the split function to remove the triple words and delimters.

In the above screen print the different location and corresponding weather forecasts are observed.

Retrieving and storing the content:

Once the content is scraped from a web page it needs to be stored in a format that is readable and useful. In this sub section the scraped data is stored in a data frame. The procedure is fairly the easy and straightforward.

To store the data, pandas is used and the data is stored in a data frame as shown below in a screen print.

Conclusion:

Web scraping is a fun and interesting process. A lot of planning and understanding of the web page is required. Web scraping proves to be a useful tool in extracting the exact information the user is interested in from a website. There is little to none when it comes to limitations of web scraping. The process can be automated and updated data can be extracted on a regular interval.

The notebook I used to model the problem can be found on the following link,