In this section of the report, I'll provide a brief introduction to the dataset I've selected for analysis. At the end of this section, I wil describe the questions that I plan on exploring over the course of the report.
Introduction to dataset:¶
I will be using TMDB movie dataset, This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.
# Import all the packages needed for analysis
import numpy as np
import pandas as pd
import csv
from datetime import datetime
import matplotlib.pyplot as plt
# to Print our visualizations
%matplotlib inline
Let's load the data and check some rows from the dataset to identify the questions that can be answered:
# Reading a csv file and storing the dataset in pandas dataframe variable tmdb_data
tmdb_data = pd.read_csv('Resources/tmdb-movies.csv')
# Printing the first five rows of the dataset
tmdb_data.head()
Questions that can be answered by looking at the datasets are:¶
Some general questions that can be answered are:¶
- Which movie had the highest and lowest profit?
- Which movie had the greatest and least runtime?
- What is the average runtime of all movies?
- Which movie had the highest and lowest budget?
- Which movie had the highest and lowest revenue?
Some questions that can be answered based on the Profit of movies making more then 25M Dollars:¶
- What is the average budget of the movie?
- What is the average revenue of the movie?
- What is the average runtime of the movie?
- Which are the successfull genres?
- Which are the most frequent cast involved?
In this section of the report, I will check for cleanliness, and then trim and clean my dataset for analysis.
Observations from above dataset are:¶
- The dataset has not provided the currency for columns we will be dealing with hence we will assume it is in dollars.
- Even the vote count is not same for all the movies and hence this affects the vote average column.
Let's check the dataset and see what cleaning does it requires.
# Perform operations to inspect data types and look for instances of missing or possibly errant data.
# Printing the first five rows of the dataset
tmdb_data.head()
# lets us check some statistics of the data
tmdb_data.describe()
# Let us check infomation on datatypes of columns and missing values.
tmdb_data.info()
# Calculate number of duplicated rows.
sum(tmdb_data.duplicated())
Cleaning that needs to be prformed by looking at above data:¶
- First, remove columns such as 'id', 'imdb_id', 'popularity', 'budget_adj', 'revenue_adj', 'homepage', 'keywords', 'director', 'tagline', 'overview', 'production_companies', 'vote_count' and 'vote_average'.
- Second, Lets delete the one duplicated row that we have in our dataset.
- Third, There are lots of movies where the budget or revenue have a value of '0' which means that the values of those movies has not been recorded. So we need to discard this rows, since we cannot calculate profit of such movies
- Fourth, The 'release_date' column must be converted into date format.
- Fifth, Convert budget and revenue column to int datatype.
- Sixth, Replace runtime value of 0 to NAN, Since it will affect the result.
First, remove columns such as 'id', 'imdb_id', 'popularity', 'budget_adj', 'revenue_adj', 'homepage', 'keywords', 'director', 'tagline', 'overview', 'production_companies', 'vote_count' and 'vote_average'.
# Columns that needs to be deleted
deleted_columns = [ 'id', 'imdb_id', 'popularity', 'budget_adj', 'revenue_adj', 'homepage', 'keywords', 'director', 'tagline', 'overview', 'production_companies', 'vote_count', 'vote_average']
# Drop the columns from the database
tmdb_data.drop(deleted_columns, axis=1, inplace=True)
# Lets look at the new dataset
tmdb_data.head()
Let's see the number of entries in our dataset now.
# Store rows and columns using shape function.
rows, col = tmdb_data.shape
#since rows includes count of a header, we need to remove its count.
print('We have {} total rows and {} columns.'.format(rows-1, col))
Second, Lets delete the one duplicated row that we have in our dataset.
# Drop duplicate rows but keep the first one
tmdb_data.drop_duplicates(keep = 'first', inplace = True)
# Store rows and columns using shape function.
rows, col = tmdb_data.shape
print('Now we have {} total rows and {} columns.'.format(rows-1, col))
Third, There are lots of movies where the budget or revenue have a value of '0' which means that the values of those movies has not been recorded. So we need to discard this rows, since we cannot calculate profit of such movies
# Columns that need to be checked.
columns = ['budget', 'revenue']
# Replace 0 with NAN
tmdb_data[columns] = tmdb_data[columns].replace(0, np.NaN)
# Drop rows which contains NAN
tmdb_data.dropna(subset = columns, inplace = True)
rows, col = tmdb_data.shape
print('We now have only {} rows.'.format(rows-1))
Fourth, The 'release_date' column must be converted into date format.
# Convert column release_date to DateTime
tmdb_data.release_date = pd.to_datetime(tmdb_data['release_date'])
# Lets look at the new dataset
tmdb_data.head()
Fifth, Convert budget and revenue column to int datatype.
# Columns to convert datatype of
columns = ['budget', 'revenue']
# Convert budget and revenue column to int datatype
tmdb_data[columns] = tmdb_data[columns].applymap(np.int64)
# Lets look at the new datatype
tmdb_data.dtypes
Sixth, Replace runtime value of 0 to NAN, Since it will affect the result.
# Replace runtime value of 0 to NAN, Since it will affect the result.
tmdb_data['runtime'] = tmdb_data['runtime'].replace(0, np.NaN)
# Check the stats of dataset
tmdb_data.describe()
We have now completed our Data Wrangling section
We will now compute statistics and create visualizations with the goal of addressing the research questions that we posed in the Introduction section.
So we will first add a column for profit in our dataset.
# To calculate profit, we need to substract the budget from the revenue.
tmdb_data['profit'] = tmdb_data['revenue'] - tmdb_data['budget']
# Lets look at the new dataset
tmdb_data.head()
# Movie with highest profit
tmdb_data.loc[tmdb_data['profit'].idxmax()]
# Movie with lowest profit
tmdb_data.loc[tmdb_data['profit'].idxmin()]
Which movie had the highest and lowest profit?¶
Highest :: Avatar with profit of 2544505847 dollars
Lowest :: The Warrior's Way with profit of -413912431 dollars
# Movie with greatest runtime
tmdb_data.loc[tmdb_data['runtime'].idxmax()]
# Movie with least runtime
tmdb_data.loc[tmdb_data['runtime'].idxmin()]
Which movie had the greatest and least runtime?¶
Greatest :: Carlos with runtime of 338 minutes
Least :: Kid's Story with runtime of 15 minutes
# Average runtime of movies
tmdb_data['runtime'].mean()
What is the average runtime of all movies?¶
So the average runtime of the movies is 109.22 minutes
Let us plot a histogram for the same.
# x-axis
plt.xlabel('Runtime of Movies in Minutes')
# y-axis
plt.ylabel('Number of Movies')
# Title of the histogram
plt.title('Runtime distribution of all the movies')
# Plot a histogram
plt.hist(tmdb_data['runtime'], bins = 50)
We can see that most of the movie are in the range of 100 minutes to 120 minutes.
Let us check if there a relation between the Runtime and Profit
# x-axis
plt.xlabel('Runtime in Minutes')
# y-axis
plt.ylabel('Profit in Dollars')
# Title of the histogram
plt.title('Relationship between runtime and profit')
plt.scatter(tmdb_data['runtime'], tmdb_data['profit'], alpha=0.5)
plt.show()
Most of the movies have runtime in range of 85 to 120 Minutes.
# Movie with highest budget
tmdb_data.loc[tmdb_data['budget'].idxmax()]
# Movie with lowest budget
tmdb_data.loc[tmdb_data['budget'].idxmin()]
Which movie had the highest and lowest budget?¶
Highest :: The Warrior's Way with budget of 425000000 dollars
Lowest :: Lost & Found with budget of 1 dollars
Let us check if there a relation between the Budget and Profit
# x-axis
plt.xlabel('Budget in Dollars')
# y-axis
plt.ylabel('Profit in Dollars')
# Title of the histogram
plt.title('Relationship between budget and profit')
plt.scatter(tmdb_data['budget'], tmdb_data['profit'], alpha=0.5)
plt.show()
We can see that there no as such relationship between budget and profits, But yes there are very less flims which didnt make profit when the budget was greater then 20M Dollar.
# Movie with highest revenue
tmdb_data.loc[tmdb_data['revenue'].idxmax()]
# Movie with lowest revenue
tmdb_data.loc[tmdb_data['revenue'].idxmin()]
Which movie had the highest and lowest revenue?¶
Highest :: Avatar with revenue of 2781505847 dollars
Lowest :: Shattered Glass with revenue of 2 dollars
Let us check if there a relation between the Revenue and Profit
# x-axis
plt.xlabel('Revenue in Dollars')
# y-axis
plt.ylabel('Profit in Dollars')
# Title of the histogram
plt.title('Relationship between revenue and profit')
plt.scatter(tmdb_data['revenue'], tmdb_data['profit'], alpha=0.5)
plt.show()
We can see that there is a strong relationship between profit and revenue, higher the profit, higher the revenue.
Let us check if there a relation between the Budget and Revenue
# x-axis
plt.xlabel('Revenue in Dollars')
# y-axis
plt.ylabel('Budget in Dollars')
# Title of the histogram
plt.title('Relationship between revenue and budget')
plt.scatter(tmdb_data['revenue'], tmdb_data['budget'], alpha=0.5)
plt.show()
Most of the movie have a revenue upto 50M Dollars.
Now since in all the remaining question we are going to answer them with respect to profit, we will now clean our datset and only incudde data of movies who made profit of more then 25M Dollars.
# Dataframe which has data of movies which made profit of more the 25M Dollars.
tmdb_profit_data = tmdb_data[tmdb_data['profit'] >= 25000000]
# Reindexing the dataframe
tmdb_profit_data.index = range(len(tmdb_profit_data))
#showing the dataset
tmdb_profit_data.head()
# Printing the info of the new dataframe
tmdb_profit_data.info()
We can see that we have 1791 movies which has profit more then 25M Dollars.
# Finfd the average budget of movies which made profit more then 25M Dollars
tmdb_profit_data['budget'].mean()
What is the average budget of the movie w.r.t Profit of movies making more then 25M Dollars?¶
So the average budget of the movies is 51870307.75 Dollars
# Finfd the average revenue of movies which made profit more then 25M Dollars
tmdb_profit_data['revenue'].mean()
What is the average revenue of the movie w.r.t Profit of movies making more then 25M Dollars?¶
So the average revenue of the movies is 206359440.87 Dollars
# Finfd the average runtime of movies which made profit more then 25M Dollars
tmdb_profit_data['runtime'].mean()
What is the average runtime of the movie w.r.t Profit of movies making more then 25M Dollars?¶
So the average runtime of the movies is 112.56 Minutes
# This will first concat all the data with | from the whole column and then split it using | and count the number of times it occured.
genres_count = pd.Series(tmdb_profit_data['genres'].str.cat(sep = '|').split('|')).value_counts(ascending = False)
genres_count
Which are the successfull genres w.r.t Profit of movies making more then 25M Dollars?¶
So the Top 10 Genres are Drama, Comedy, Action, Thriller, Adventure, Romance, Crime, Family, Scince Fiction, Fantasy
Lets visualize this with a plot
# Initialize the plot
diagram = genres_count.plot.bar(fontsize = 8)
# Set a title
diagram.set(title = 'Top Genres')
# x-label and y-label
diagram.set_xlabel('Type of genres')
diagram.set_ylabel('Number of Movies')
# Show the plot
plt.show()
We can clearly see in the visualization that most movies has drame as a genre which tends to higher profit.
# This will first concat all the data with | from the whole column and then split it using | and count the number of times it occured.
cast_count = pd.Series(tmdb_profit_data['cast'].str.cat(sep = '|').split('|')).value_counts(ascending = False)
cast_count.head(20)
Which are the most frequent cast involved w.r.t Profit of movies making more then 25M Dollars?¶
So the Top 5 cast are Tom Cruise, Tom Hanks, Brad Pitt, Robert De Niro, Bruce Willis
Lets visualize this with a plot
# Initialize the plot
diagram = cast_count.head(20).plot.barh(fontsize = 8)
# Set a title
diagram.set(title = 'Top Cast')
# x-label and y-label
diagram.set_xlabel('Number of Movies')
diagram.set_ylabel('List of cast')
# Show the plot
plt.show()
We can clearly see in the visualization that most movies have Tom Cruise as a cast which tends to higher profit.
So the conclusion is, that if we want to create movies which can give us a profit of more then 25M Dollars then¶
The average budget of the movies can be arround 51870307.75 Dollars
The average runtime of the movies can be arround 112.56 Minutes
The Top 10 Genres we should focus on should be Drama, Comedy, Action, Thriller, Adventure, Romance, Crime, Family, Scince Fiction, Fantasy
The Top 5 cast we should focus on should be Tom Cruise, Tom Hanks, Brad Pitt, Robert De Niro, Bruce Willis
The average revenue of the movies will be arround 206359440.87 Dollars
The limitations associated with the conclusions are:¶
The conclusion is not full proof that given the above requirement the movie will be a big hit but it can be.
Also, we also lost some of the data in the data cleaning steps where we dont know the revenue and budget of the movie, which has affected our analysis.
This conclusion is not error proof.