Practice Python — Web Scraping : IMDB US Box Office
Web scraping is data scraping used for extracting data from websites. We can create web scraping tool using Python.
In this tutorial I will show you how to extract simple data from IMDB using module requests and beautifulsoup. We will retrieve top box office list in the US from this link https://www.imdb.com/chart/boxoffice/ .
First, we will design how this project will be. We will create package named boxoffice and main.py will call this package later on. File requierements.txt also be created that contain list of necessary packages. In this case we need requests and bs4. File structure of our project will be like this.
We will write the code in file __init__.py under folder boxoffice. First we will import requests and bs4
import requests
from bs4 import BeautifulSoup
Next we create function named extract and first script is to get response and parser the text from https://www.imdb.com/chart/boxoffice/
response = requests.get('https://www.imdb.com/chart/boxoffice/')
soup = BeautifulSoup(response.text, "html.parser")
Next step we will check how data structure on this link. When I write this article top box office in US look like below capture
If we check using developer tool on Chrome, and hover on “Hallowen Kills” we will get following script.
<td class=”titleColumn”><a href=”/title/tt10665338?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=f9f31d04-fc22–4d12–86b4-f46e25aa2f6f&pf_rd_r=0N7Z36ETMH3T06CAA6RZ&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=boxoffice&ref_=cht_bo_1" title=”David Gordon Green (dir.), Jamie Lee Curtis, Judy Greer”>Halloween Kills</a></td>
To retrieve above script we use following command
movies = soup.select("td.titleColumn")
Weekend and gross value script
<td class=”ratingColumn”>$50.4M</td>
<td class=”ratingColumn”><span class=”secondaryInfo”>$50.4M</span></td>
income = soup.select("td.ratingColumn")
last value we will retrive is weeks. The script look like below
<td class=”weeksColumn”>1</td>
To retrieve the value we will use following command
weeks = soup.select("td.weeksColumn")
Next we will use for loop to get all 10 list of box office
result = []
for i in range(0,9):
movies_title = movies[i].get_text().split(",")[0].strip()
weekend = income[i*2].get_text().split(",")[0].strip()
gross = income[i*2+1].get_text().split(",")[0].strip()
weeks_long = weeks[i].get_text()
data = {"movie": movies_title,
"weekend": weekend,
"gross": gross,
"weeks": weeks_long
}
result.append(data)
To print the value we use following script
i = 0
for movie in result:
i+=1
print(f"{i}. {movie['movie']} - {movie['weekend']} - {movie['gross']} - {movie['weeks']}")
In main.py write following scripts
from boxoffice import extract
if __name__ == '__main__':
extract()
If you run main.py you will see following list
Package boxoffice file __init__.py
You can see the complete scrips on the github