Python : Using Beautiful Soup
This post is about using BeautifulSoup
and requests
module in python application for webscrapping.
We are writing a simple weather application that uses request module to html page from a website and then uses
BeautifulSoup
to scrap the web and to capture weather info at a particular zip code. After reading this post
you will get a basic idea on how to use Beautiful Soup
and requests
module in python.
Note :I have recently updated this post to use BS4 and requests.
Beautiful Soup sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
Requests is a module that makes sending http requests simple.So first step is to include these packages as part of our project and
using it. We will use pip
to install the packages.
pip install beautifulsoup4
pip install requests
import requests
from bs4 import BeautifulSoup
import collections
We will be using weather-underground to determine the conditions at a particular location. For the sake of our application goal, instead of using their web service we will be downloading the html and parsing that to determine the weather. If you pass zip code along with a base url, weather underground will display weather at that location as shown below.

As a first step, let us get the zip code from user.
def get_zip():
zip_code = input("Enter your zip code : ")
return zip_code
Now download the html from url using requests
and parse it using BS4.
def get_url_html(url):
response = requests.get(url)
# response.status_code 200 OK
# returns html
return response.text
I uses Chrome developer tool to inspect the html source and determine whilch html element has required info. First item I am trying
to read is the city name and it resides in an H1
tag inside a div with id="location"
. In a similar way ‘condition’ is in a div
with id='curCond'
and class = 'wx-value'
and so on.. So based on html and css info let us grab required information from the web page.
def get_weather_from_html(html):
soup = BeautifulSoup(html,'html.parser')
loc = soup.find(id='location').find('h1').get_text().strip()
loc = get_city(loc)
condition = soup.find(id='curCond').find(class_ ='wx-value').get_text()
condition = clean_text(condition)
temp = soup.find(id='curTemp').find(class_ ='wx-data').find(class_ = 'wx-value').get_text()
temp = clean_text(temp)
unit = soup.find(id='curTemp').find(class_ ='wx-data').find(class_ = 'wx-unit').get_text()
unit = clean_text(unit)
# print("Weather @ {0} : Temperature = {1}{2} {3}".format(loc, temp, unit, condition))
# return named tuple
w_report = WeatherReport(location=loc, temperature=temp, scale=unit, cond=condition)
return w_report
Above function returns results as a named tuple, which is easy when we are passing large result as a tuple.
This is how we create a namedtuple
. It is part of collections module.
# create a named tuple for returning multiple values
WeatherReport = collections.namedtuple('WeatherReport', 'location, temperature, scale, cond')
Its also important that results might need some clean up. For example city name we got back is having new line character and we need to do some clean up there.
def clean_text(text: str):
if not text:
return text
return text.strip()
def get_city(text: str):
if not text:
return text
text = clean_text(text)
parts = text.split('\n')
return parts[0]
Given below is the __main__
entry.
if __name__ == '__main__':
print_header()
zip_code = input("Enter your zip : ")
url = base_url+'/'+zip_code
html = get_url_html(url)
w_report = get_weather_from_html(html)
print("The weather in {} is {}{} and {}".format(w_report.location, w_report.temperature,
w_report.scale, w_report.cond))
Result
--------------------------------------
Weather App
--------------------------------------
Enter your zip : 92127
The weather in San Diego, CA is 75.8°F and Clear
Process finished with exit code 0
Coding is fun enjoy…