Nasir Noor
2 min readOct 18, 2021

--

Practice Python — Web Scraping : IMDB US Box Office

Web scraping is data scraping used for extracting data from websites. We can create web scraping tool using Python.

In this tutorial I will show you how to extract simple data from IMDB using module requests and beautifulsoup. We will retrieve top box office list in the US from this link https://www.imdb.com/chart/boxoffice/ .

First, we will design how this project will be. We will create package named boxoffice and main.py will call this package later on. File requierements.txt also be created that contain list of necessary packages. In this case we need requests and bs4. File structure of our project will be like this.

We will write the code in file __init__.py under folder boxoffice. First we will import requests and bs4

import requests
from bs4 import BeautifulSoup

Next we create function named extract and first script is to get response and parser the text from https://www.imdb.com/chart/boxoffice/

response = requests.get('https://www.imdb.com/chart/boxoffice/')
soup = BeautifulSoup(response.text, "html.parser")

Next step we will check how data structure on this link. When I write this article top box office in US look like below capture

If we check using developer tool on Chrome, and hover on “Hallowen Kills” we will get following script.

<td class=”titleColumn”><a href=”/title/tt10665338?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=f9f31d04-fc22–4d12–86b4-f46e25aa2f6f&amp;pf_rd_r=0N7Z36ETMH3T06CAA6RZ&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=boxoffice&amp;ref_=cht_bo_1" title=”David Gordon Green (dir.), Jamie Lee Curtis, Judy Greer”>Halloween Kills</a></td>

To retrieve above script we use following command

movies = soup.select("td.titleColumn")

Weekend and gross value script

<td class=”ratingColumn”>$50.4M</td>
<td class=”ratingColumn”><span class=”secondaryInfo”>$50.4M</span></td>

income = soup.select("td.ratingColumn")

last value we will retrive is weeks. The script look like below

<td class=”weeksColumn”>1</td>

To retrieve the value we will use following command

weeks = soup.select("td.weeksColumn")

Next we will use for loop to get all 10 list of box office

result = []
for i in range(0,9):
movies_title = movies[i].get_text().split(",")[0].strip()
weekend = income[i*2].get_text().split(",")[0].strip()
gross = income[i*2+1].get_text().split(",")[0].strip()
weeks_long = weeks[i].get_text()
data = {"movie": movies_title,
"weekend": weekend,
"gross": gross,
"weeks": weeks_long
}
result.append(data)

To print the value we use following script

i = 0
for movie in result:
i+=1
print(f"{i}. {movie['movie']} - {movie['weekend']} - {movie['gross']} - {movie['weeks']}")

In main.py write following scripts

from boxoffice import extract

if __name__ == '__main__':
extract()

If you run main.py you will see following list

Package boxoffice file __init__.py

You can see the complete scrips on the github

https://github.com/nasirnooruddin/boxoffice

--

--