Project: Investigate a Dataset (TMDb movie Database)¶

Table of Contents¶

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions

Introduction¶

In this section of the report, I'll provide a brief introduction to the dataset I've selected for analysis. At the end of this section, I wil describe the questions that I plan on exploring over the course of the report.

Introduction to dataset:¶
I will be using TMDB movie dataset, This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

# Import all the packages needed for analysis
import numpy as np
import pandas as pd
import csv
from datetime import datetime
import matplotlib.pyplot as plt
# to Print our visualizations
%matplotlib inline

Let's load the data and check some rows from the dataset to identify the questions that can be answered:

# Reading a csv file and storing the dataset in pandas dataframe variable tmdb_data
tmdb_data = pd.read_csv('Resources/tmdb-movies.csv')
# Printing the first five rows of the dataset
tmdb_data.head()

Questions that can be answered by looking at the datasets are:¶
Some general questions that can be answered are:¶

Which movie had the highest and lowest profit?

Which movie had the greatest and least runtime?

What is the average runtime of all movies?

Which movie had the highest and lowest budget?

Which movie had the highest and lowest revenue?

Some questions that can be answered based on the Profit of movies making more then 25M Dollars:¶

What is the average budget of the movie?

What is the average revenue of the movie?

What is the average runtime of the movie?

Which are the successfull genres?

Which are the most frequent cast involved?

Data Wrangling¶

In this section of the report, I will check for cleanliness, and then trim and clean my dataset for analysis.

Observations from above dataset are:¶

The dataset has not provided the currency for columns we will be dealing with hence we will assume it is in dollars.

Even the vote count is not same for all the movies and hence this affects the vote average column.

General Properties¶

Let's check the dataset and see what cleaning does it requires.

# Perform operations to inspect data types and look for instances of missing or possibly errant data.
# Printing the first five rows of the dataset
tmdb_data.head()

# lets us check some statistics of the data
tmdb_data.describe()

# Let us check infomation on datatypes of columns and missing values.
tmdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

# Calculate number of duplicated rows.
sum(tmdb_data.duplicated())

1

Cleaning that needs to be prformed by looking at above data:¶

First, remove columns such as 'id', 'imdb_id', 'popularity', 'budget_adj', 'revenue_adj', 'homepage', 'keywords', 'director', 'tagline', 'overview', 'production_companies', 'vote_count' and 'vote_average'.

Second, Lets delete the one duplicated row that we have in our dataset.

Third, There are lots of movies where the budget or revenue have a value of '0' which means that the values of those movies has not been recorded. So we need to discard this rows, since we cannot calculate profit of such movies

Fourth, The 'release_date' column must be converted into date format.

Fifth, Convert budget and revenue column to int datatype.

Sixth, Replace runtime value of 0 to NAN, Since it will affect the result.

Data Cleaning (Lets us perform all the steps that are discussed above for cleaning)¶

First, remove columns such as 'id', 'imdb_id', 'popularity', 'budget_adj', 'revenue_adj', 'homepage', 'keywords', 'director', 'tagline', 'overview', 'production_companies', 'vote_count' and 'vote_average'.

# Columns that needs to be deleted
deleted_columns = [ 'id', 'imdb_id', 'popularity', 'budget_adj', 'revenue_adj', 'homepage', 'keywords', 'director', 'tagline', 'overview', 'production_companies', 'vote_count', 'vote_average']
# Drop the columns from the database
tmdb_data.drop(deleted_columns, axis=1, inplace=True)
# Lets look at the new dataset
tmdb_data.head()

Let's see the number of entries in our dataset now.

# Store rows and columns using shape function.
rows, col = tmdb_data.shape
#since rows includes count of a header, we need to remove its count.
print('We have {} total rows and {} columns.'.format(rows-1, col))

We have 10865 total rows and 8 columns.

Second, Lets delete the one duplicated row that we have in our dataset.

# Drop duplicate rows but keep the first one
tmdb_data.drop_duplicates(keep = 'first', inplace = True)
# Store rows and columns using shape function.
rows, col = tmdb_data.shape
print('Now we have {} total rows and {} columns.'.format(rows-1, col))

Now we have 10864 total rows and 8 columns.

Third, There are lots of movies where the budget or revenue have a value of '0' which means that the values of those movies has not been recorded. So we need to discard this rows, since we cannot calculate profit of such movies

# Columns that need to be checked.
columns = ['budget', 'revenue']
# Replace 0 with NAN
tmdb_data[columns] = tmdb_data[columns].replace(0, np.NaN)
# Drop rows which contains NAN
tmdb_data.dropna(subset = columns, inplace = True)
rows, col = tmdb_data.shape
print('We now have only {} rows.'.format(rows-1))

We now have only 3853 rows.

Fourth, The 'release_date' column must be converted into date format.

# Convert column release_date to DateTime
tmdb_data.release_date = pd.to_datetime(tmdb_data['release_date'])
# Lets look at the new dataset
tmdb_data.head()

Fifth, Convert budget and revenue column to int datatype.

# Columns to convert datatype of
columns = ['budget', 'revenue']
# Convert budget and revenue column to int datatype
tmdb_data[columns] = tmdb_data[columns].applymap(np.int64)
# Lets look at the new datatype
tmdb_data.dtypes

budget                     int64
revenue                    int64
original_title            object
cast                      object
runtime                    int64
genres                    object
release_date      datetime64[ns]
release_year               int64
dtype: object

Sixth, Replace runtime value of 0 to NAN, Since it will affect the result.

# Replace runtime value of 0 to NAN, Since it will affect the result.
tmdb_data['runtime'] = tmdb_data['runtime'].replace(0, np.NaN)
# Check the stats of dataset
tmdb_data.describe()

We have now completed our Data Wrangling section

Exploratory Data Analysis¶

We will now compute statistics and create visualizations with the goal of addressing the research questions that we posed in the Introduction section.

Research Question 1.1 (Which movie had the highest and lowest profit?)¶

So we will first add a column for profit in our dataset.

# To calculate profit, we need to substract the budget from the revenue.
tmdb_data['profit'] = tmdb_data['revenue'] - tmdb_data['budget']
# Lets look at the new dataset
tmdb_data.head()

# Movie with highest profit
tmdb_data.loc[tmdb_data['profit'].idxmax()]

budget                                                    237000000
revenue                                                  2781505847
original_title                                               Avatar
cast              Sam Worthington|Zoe Saldana|Sigourney Weaver|S...
runtime                                                         162
genres                     Action|Adventure|Fantasy|Science Fiction
release_date                                    2009-12-10 00:00:00
release_year                                                   2009
profit                                                   2544505847
Name: 1386, dtype: object

# Movie with lowest profit
tmdb_data.loc[tmdb_data['profit'].idxmin()]

budget                                                    425000000
revenue                                                    11087569
original_title                                    The Warrior's Way
cast              Kate Bosworth|Jang Dong-gun|Geoffrey Rush|Dann...
runtime                                                         100
genres                    Adventure|Fantasy|Action|Western|Thriller
release_date                                    2010-12-02 00:00:00
release_year                                                   2010
profit                                                   -413912431
Name: 2244, dtype: object

Which movie had the highest and lowest profit?¶
Highest :: Avatar with profit of 2544505847 dollars

Lowest :: The Warrior's Way with profit of -413912431 dollars

Research Question 1.2 (Which movie had the greatest and least runtime?)¶

# Movie with greatest runtime
tmdb_data.loc[tmdb_data['runtime'].idxmax()]

budget                                                     18000000
revenue                                                      871279
original_title                                               Carlos
cast              Edgar RamÃ­rez|Alexander Scheer|Fadi Abi Samra...
runtime                                                         338
genres                                 Crime|Drama|Thriller|History
release_date                                    2010-05-19 00:00:00
release_year                                                   2010
profit                                                    -17128721
Name: 2107, dtype: object

# Movie with least runtime
tmdb_data.loc[tmdb_data['runtime'].idxmin()]

budget                                                           10
revenue                                                           5
original_title                                          Kid's Story
cast              Clayton Watson|Keanu Reeves|Carrie-Anne Moss|K...
runtime                                                          15
genres                                    Science Fiction|Animation
release_date                                    2003-06-02 00:00:00
release_year                                                   2003
profit                                                           -5
Name: 5162, dtype: object

Which movie had the greatest and least runtime?¶
Greatest :: Carlos with runtime of 338 minutes

Least :: Kid's Story with runtime of 15 minutes

Research Question 1.3 (What is the average runtime of all movies?)¶

# Average runtime of movies
tmdb_data['runtime'].mean()

109.22029060716139

What is the average runtime of all movies?¶
So the average runtime of the movies is 109.22 minutes

Let us plot a histogram for the same.

# x-axis
plt.xlabel('Runtime of Movies in Minutes')
# y-axis
plt.ylabel('Number of Movies')
# Title of the histogram
plt.title('Runtime distribution of all the movies')
# Plot a histogram
plt.hist(tmdb_data['runtime'], bins = 50)

(array([  1.,   1.,   0.,   0.,   1.,   0.,   0.,   3.,   5.,  31., 196.,
        437., 598., 653., 475., 458., 305., 243., 141., 117.,  49.,  47.,
         21.,  23.,   9.,  10.,  10.,   8.,   6.,   0.,   2.,   2.,   0.,
          0.,   0.,   0.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   1.]),
 array([ 15.  ,  21.46,  27.92,  34.38,  40.84,  47.3 ,  53.76,  60.22,
         66.68,  73.14,  79.6 ,  86.06,  92.52,  98.98, 105.44, 111.9 ,
        118.36, 124.82, 131.28, 137.74, 144.2 , 150.66, 157.12, 163.58,
        170.04, 176.5 , 182.96, 189.42, 195.88, 202.34, 208.8 , 215.26,
        221.72, 228.18, 234.64, 241.1 , 247.56, 254.02, 260.48, 266.94,
        273.4 , 279.86, 286.32, 292.78, 299.24, 305.7 , 312.16, 318.62,
        325.08, 331.54, 338.  ]),
 <a list of 50 Patch objects>)

We can see that most of the movie are in the range of 100 minutes to 120 minutes.

Let us check if there a relation between the Runtime and Profit

# x-axis
plt.xlabel('Runtime in Minutes')
# y-axis
plt.ylabel('Profit in Dollars')
# Title of the histogram
plt.title('Relationship between runtime and profit')
plt.scatter(tmdb_data['runtime'], tmdb_data['profit'], alpha=0.5)
plt.show()

Most of the movies have runtime in range of 85 to 120 Minutes.

Research Question 1.4 (Which movie had the highest and lowest budget?)¶

# Movie with highest budget
tmdb_data.loc[tmdb_data['budget'].idxmax()]

budget                                                    425000000
revenue                                                    11087569
original_title                                    The Warrior's Way
cast              Kate Bosworth|Jang Dong-gun|Geoffrey Rush|Dann...
runtime                                                         100
genres                    Adventure|Fantasy|Action|Western|Thriller
release_date                                    2010-12-02 00:00:00
release_year                                                   2010
profit                                                   -413912431
Name: 2244, dtype: object

# Movie with lowest budget
tmdb_data.loc[tmdb_data['budget'].idxmin()]

budget                                                            1
revenue                                                         100
original_title                                         Lost & Found
cast              David Spade|Sophie Marceau|Ever Carradine|Step...
runtime                                                          95
genres                                               Comedy|Romance
release_date                                    1999-04-23 00:00:00
release_year                                                   1999
profit                                                           99
Name: 2618, dtype: object

Which movie had the highest and lowest budget?¶
Highest :: The Warrior's Way with budget of 425000000 dollars

Lowest :: Lost & Found with budget of 1 dollars

Let us check if there a relation between the Budget and Profit

# x-axis
plt.xlabel('Budget in Dollars')
# y-axis
plt.ylabel('Profit in Dollars')
# Title of the histogram
plt.title('Relationship between budget and profit')
plt.scatter(tmdb_data['budget'], tmdb_data['profit'], alpha=0.5)
plt.show()

We can see that there no as such relationship between budget and profits, But yes there are very less flims which didnt make profit when the budget was greater then 20M Dollar.

Research Question 1.5 (Which movie had the highest and lowest revenue?)¶

# Movie with highest revenue
tmdb_data.loc[tmdb_data['revenue'].idxmax()]

budget                                                    237000000
revenue                                                  2781505847
original_title                                               Avatar
cast              Sam Worthington|Zoe Saldana|Sigourney Weaver|S...
runtime                                                         162
genres                     Action|Adventure|Fantasy|Science Fiction
release_date                                    2009-12-10 00:00:00
release_year                                                   2009
profit                                                   2544505847
Name: 1386, dtype: object

# Movie with lowest revenue
tmdb_data.loc[tmdb_data['revenue'].idxmin()]

budget                                                      6000000
revenue                                                           2
original_title                                      Shattered Glass
cast              Hayden Christensen|Peter Sarsgaard|ChloÃ« Sevi...
runtime                                                          94
genres                                                Drama|History
release_date                                    2003-11-14 00:00:00
release_year                                                   2003
profit                                                     -5999998
Name: 5067, dtype: object

Which movie had the highest and lowest revenue?¶
Highest :: Avatar with revenue of 2781505847 dollars

Lowest :: Shattered Glass with revenue of 2 dollars

Let us check if there a relation between the Revenue and Profit

# x-axis
plt.xlabel('Revenue in Dollars')
# y-axis
plt.ylabel('Profit in Dollars')
# Title of the histogram
plt.title('Relationship between revenue and profit')
plt.scatter(tmdb_data['revenue'], tmdb_data['profit'], alpha=0.5)
plt.show()

We can see that there is a strong relationship between profit and revenue, higher the profit, higher the revenue.

Let us check if there a relation between the Budget and Revenue

# x-axis
plt.xlabel('Revenue in Dollars')
# y-axis
plt.ylabel('Budget in Dollars')
# Title of the histogram
plt.title('Relationship between revenue and budget')
plt.scatter(tmdb_data['revenue'], tmdb_data['budget'], alpha=0.5)
plt.show()

Most of the movie have a revenue upto 50M Dollars.

Research Question 2.1 (What is the average budget of the movie w.r.t Profit of movies making more then 25M Dollars?)¶

Now since in all the remaining question we are going to answer them with respect to profit, we will now clean our datset and only incudde data of movies who made profit of more then 25M Dollars.

# Dataframe which has data of movies which made profit of more the 25M Dollars.
tmdb_profit_data = tmdb_data[tmdb_data['profit'] >= 25000000]
# Reindexing the dataframe
tmdb_profit_data.index = range(len(tmdb_profit_data))
#showing the dataset
tmdb_profit_data.head()

# Printing the info of the new dataframe
tmdb_profit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1791 entries, 0 to 1790
Data columns (total 9 columns):
budget            1791 non-null int64
revenue           1791 non-null int64
original_title    1791 non-null object
cast              1790 non-null object
runtime           1791 non-null int64
genres            1791 non-null object
release_date      1791 non-null datetime64[ns]
release_year      1791 non-null int64
profit            1791 non-null int64
dtypes: datetime64[ns](1), int64(5), object(3)
memory usage: 126.0+ KB

We can see that we have 1791 movies which has profit more then 25M Dollars.

# Finfd the average budget of movies which made profit more then 25M Dollars
tmdb_profit_data['budget'].mean()

51870307.757118925

What is the average budget of the movie w.r.t Profit of movies making more then 25M Dollars?¶
So the average budget of the movies is 51870307.75 Dollars

Research Question 2.2 (What is the average revenue of the movie w.r.t Profit of movies making more then 25M Dollars?)¶

# Finfd the average revenue of movies which made profit more then 25M Dollars
tmdb_profit_data['revenue'].mean()

206359440.87269682

What is the average revenue of the movie w.r.t Profit of movies making more then 25M Dollars?¶
So the average revenue of the movies is 206359440.87 Dollars

Research Question 2.3 (What is the average runtime of the movie w.r.t Profit of movies making more then 25M Dollars?)¶

# Finfd the average runtime of movies which made profit more then 25M Dollars
tmdb_profit_data['runtime'].mean()

112.56672250139587

What is the average runtime of the movie w.r.t Profit of movies making more then 25M Dollars?¶
So the average runtime of the movies is 112.56 Minutes

Research Question 2.4 (Which are the successfull genres w.r.t Profit of movies making more then 25M Dollars?)¶

# This will first concat all the data with | from the whole column and then split it using | and count the number of times it occured. 
genres_count = pd.Series(tmdb_profit_data['genres'].str.cat(sep = '|').split('|')).value_counts(ascending = False)
genres_count

Drama              688
Comedy             645
Action             566
Thriller           542
Adventure          451
Romance            292
Crime              287
Family             265
Science Fiction    250
Fantasy            227
Horror             191
Mystery            150
Animation          136
Music               62
History             59
War                 58
Western             20
Documentary          9
TV Movie             1
Foreign              1
dtype: int64

Which are the successfull genres w.r.t Profit of movies making more then 25M Dollars?¶
So the Top 10 Genres are Drama, Comedy, Action, Thriller, Adventure, Romance, Crime, Family, Scince Fiction, Fantasy

Lets visualize this with a plot

# Initialize the plot
diagram = genres_count.plot.bar(fontsize = 8)
# Set a title
diagram.set(title = 'Top Genres')
# x-label and y-label
diagram.set_xlabel('Type of genres')
diagram.set_ylabel('Number of Movies')
# Show the plot
plt.show()

We can clearly see in the visualization that most movies has drame as a genre which tends to higher profit.

Research Question 2.5 (Which are the most frequent cast involved w.r.t Profit of movies making more then 25M Dollars?)¶

# This will first concat all the data with | from the whole column and then split it using | and count the number of times it occured. 
cast_count = pd.Series(tmdb_profit_data['cast'].str.cat(sep = '|').split('|')).value_counts(ascending = False)
cast_count.head(20)

Tom Cruise               29
Tom Hanks                28
Brad Pitt                27
Robert De Niro           26
Bruce Willis             25
Cameron Diaz             24
Samuel L. Jackson        23
Eddie Murphy             23
Sylvester Stallone       22
Johnny Depp              22
Mark Wahlberg            22
Robin Williams           20
Jim Carrey               20
Harrison Ford            20
Matt Damon               20
George Clooney           20
Adam Sandler             20
Denzel Washington        20
Arnold Schwarzenegger    19
Owen Wilson              19
dtype: int64

Which are the most frequent cast involved w.r.t Profit of movies making more then 25M Dollars?¶
So the Top 5 cast are Tom Cruise, Tom Hanks, Brad Pitt, Robert De Niro, Bruce Willis

Lets visualize this with a plot

# Initialize the plot
diagram = cast_count.head(20).plot.barh(fontsize = 8)
# Set a title
diagram.set(title = 'Top Cast')
# x-label and y-label
diagram.set_xlabel('Number of Movies')
diagram.set_ylabel('List of cast')
# Show the plot
plt.show()

We can clearly see in the visualization that most movies have Tom Cruise as a cast which tends to higher profit.

Conclusions¶

So the conclusion is, that if we want to create movies which can give us a profit of more then 25M Dollars then¶
The average budget of the movies can be arround 51870307.75 Dollars

The average runtime of the movies can be arround 112.56 Minutes

The Top 10 Genres we should focus on should be Drama, Comedy, Action, Thriller, Adventure, Romance, Crime, Family, Scince Fiction, Fantasy

The Top 5 cast we should focus on should be Tom Cruise, Tom Hanks, Brad Pitt, Robert De Niro, Bruce Willis

The average revenue of the movies will be arround 206359440.87 Dollars

The limitations associated with the conclusions are:¶
The conclusion is not full proof that given the above requirement the movie will be a big hit but it can be.

Also, we also lost some of the data in the data cleaning steps where we dont know the revenue and budget of the movie, which has affected our analysis.

This conclusion is not error proof.

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185	7.1	2015	1.379999e+08	3.481613e+08
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	One Choice Can Destroy You	...	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	3/18/15	2480	6.3	2015	1.012000e+08	2.716190e+08
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Every generation has a story.	...	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	12/15/15	5292	7.5	2015	1.839999e+08	1.902723e+09
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Vengeance Hits Home	...	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	4/1/15	2947	7.3	2015	1.747999e+08	1.385749e+09

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185	7.1	2015	1.379999e+08	3.481613e+08
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	One Choice Can Destroy You	...	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	3/18/15	2480	6.3	2015	1.012000e+08	2.716190e+08
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Every generation has a story.	...	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	12/15/15	5292	7.5	2015	1.839999e+08	1.902723e+09
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Vengeance Hits Home	...	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	4/1/15	2947	7.3	2015	1.747999e+08	1.385749e+09

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10866.000000	10866.000000	1.086600e+04	1.086600e+04	10866.000000	10866.000000	10866.000000	10866.000000	1.086600e+04	1.086600e+04
mean	66064.177434	0.646441	1.462570e+07	3.982332e+07	102.070863	217.389748	5.974922	2001.322658	1.755104e+07	5.136436e+07
std	92130.136561	1.000185	3.091321e+07	1.170035e+08	31.381405	575.619058	0.935142	12.812941	3.430616e+07	1.446325e+08
min	5.000000	0.000065	0.000000e+00	0.000000e+00	0.000000	10.000000	1.500000	1960.000000	0.000000e+00	0.000000e+00
25%	10596.250000	0.207583	0.000000e+00	0.000000e+00	90.000000	17.000000	5.400000	1995.000000	0.000000e+00	0.000000e+00
50%	20669.000000	0.383856	0.000000e+00	0.000000e+00	99.000000	38.000000	6.000000	2006.000000	0.000000e+00	0.000000e+00
75%	75610.000000	0.713817	1.500000e+07	2.400000e+07	111.000000	145.750000	6.600000	2011.000000	2.085325e+07	3.369710e+07
max	417859.000000	32.985763	4.250000e+08	2.781506e+09	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08	2.827124e+09

	budget	revenue	original_title	cast	runtime	genres	release_date	release_year
0	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	124	Action\|Adventure\|Science Fiction\|Thriller	6/9/15	2015
1	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	120	Action\|Adventure\|Science Fiction\|Thriller	5/13/15	2015
2	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	119	Adventure\|Science Fiction\|Thriller	3/18/15	2015
3	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	136	Action\|Adventure\|Science Fiction\|Fantasy	12/15/15	2015
4	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	137	Action\|Crime\|Thriller	4/1/15	2015

	budget	revenue	original_title	cast	runtime	genres	release_date	release_year
0	150000000.0	1.513529e+09	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	124	Action\|Adventure\|Science Fiction\|Thriller	2015-06-09	2015
1	150000000.0	3.784364e+08	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	120	Action\|Adventure\|Science Fiction\|Thriller	2015-05-13	2015
2	110000000.0	2.952382e+08	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	119	Adventure\|Science Fiction\|Thriller	2015-03-18	2015
3	200000000.0	2.068178e+09	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	136	Action\|Adventure\|Science Fiction\|Fantasy	2015-12-15	2015
4	190000000.0	1.506249e+09	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	137	Action\|Crime\|Thriller	2015-04-01	2015

	budget	revenue	runtime	release_year
count	3.854000e+03	3.854000e+03	3854.000000	3854.000000
mean	3.720370e+07	1.076866e+08	109.220291	2001.261028
std	4.220822e+07	1.765393e+08	19.922820	11.282575
min	1.000000e+00	2.000000e+00	15.000000	1960.000000
25%	1.000000e+07	1.360003e+07	95.000000	1995.000000
50%	2.400000e+07	4.480000e+07	106.000000	2004.000000
75%	5.000000e+07	1.242125e+08	119.000000	2010.000000
max	4.250000e+08	2.781506e+09	338.000000	2015.000000