top of page

Exploratory Data Analysis of IMDb Dataset by R


exploratory data analysis of movies

There is the dataset of movies included to IMDB at kaggle.com. If you are interested in the exploration of movies, firstly you should download file "movie_metadata.csv" from that web-page. Then this file must be downloaded in R by the code

Using names() we can see names of columns of data frame "movie"

The dimension of the data frame equals


So, the data frame contains 5043 rows and 28 columns.

The names of columns (variables) and the dimension can be taken through str()


Also we can use summary() to see a bit more information about variables in the table



The function qplot() gives the distribution of movies over years as figure below shows.


What is the relationship between "movie_facebook_likes" and "imdb_score"?

We might assume that most movies with score over 5 have facebook likes from 0 till 30-40 thousands.

The dataset has the variable "cast_total_facebook_likes", which is calculated by summing up the facebook popularity of all the available cast members. We are adding its values to previuos plot

We have a new scatter plot, where "cast_total_facebook_likes" are blue points

If we want to look at scatter plot of relations between other variables, we may use the recordR produces the group of scatter plots.

Scatter Plot

The paragraph is about bubble plot where points indicates how many users voted and how many of them reviwed movies and point size is proportional to the nubmer of likes on Facebook. To create the bubble plot we must input into R code





Consider how movies can be grouped according to the size of the budget and the gross. We graphically represent the location of these groups using a "heat map" (hexbin plot). Observations are distributed over hexagonal cells in it. Cells can differ in tint, by default, of gray or another color, which is put in accordance with the number of observations in each of these cells.

The summary () function shows that the smallest value for the budget is 218, and the largest is 12215500000. The minimum value of gross is 162, and the maximum value is 760505847. During making a diagram, only those films that have more than $ 10 million and less than $ 500 of budget and gross will be selected. "Million dollars" will be the unit of measurement of these variables.


To get a color chart, in the function plot (), additional arguments are added

Here is the color hexagonal binning plot

There are alternatives Color Ramps on Perceptually Linear Scales


as well as alternatives of Color Palletes used to create a vector of n contiguous colors

To display hexagonally binned data as it is shown on figure above we can apply the fuction hexbinplot()

To display points of movies on budget and gross with transperant markers we record


Another way is definition of color density





If we want to explore the data frame with only complete cases, we must check it for missing data. To check if "movie" has NA variables we put command into R

and / or

and / or

and / or


Really, we found that "movie" has NAs. Therefore we will create new data frame based on the recent one and call it "imdbdf".

New data frame has the same head and dimesion

StartFragmentNext we are to remove instances, which have at least one NA variable, and look at dimension

Now the cleaned "imdbdf" has 3801 rows

Moreover, it has sense to check for duplicated instances according to title

Thus, the data frame has been reduced till 3700 rows.


It may happen that we do not need all variables (columns). Then we choose the desired columns, e.g.

View how many rows and coulmns this data frame has

Let's see what rating each actor takes.

Load add-on package "plyr" and use its function "ddply" to calculate the average rating and standart error (SE) for each main actor.

"ddply" splits "imdbdf" into subsets by variable "actor_1_name". After that it calculates average IMDB score, and SE, and the number of observation (N) for each table (devoted to actor).

Note,

summarise - function;

na.rm - the argument, which removes missing values;

sd - standart deviation of the mean;

sqrt - square root.

Finally, the function "ddply" creates the new data frame "ratingdat" by bringing those tables together.

The data frame "ratingdat" consists of 1456 rows and 4 columns, but there are NAs in it. Applying head(ratingdat) or ratingdat[1:50, ], we may see NAs in column "SE" in opposite N = 1.

Select instances with N>=15.

Make actor in an ordered factor, ordering by mean rating

The next step is creating a plot of main actors ordered by their mean ratings. It requires such add-onn packages, as

And now we can get the plot by the command line

See the plot of actor ratings.

A new data frame is created based on "movie"

Thus, we have just created "imdbdf2" that is the data frame without NAs, supported by na.omit(), and duplicated instances. It consists of 3701 rows and 6 columns ("actor_1_name","movie_title", "title_year", "gross", "imdb_score", "plot_keywords").


The figure below depicts the distibution of movie IMDb scores by year.

To obtain this figure it is necessary to run


We define how many movies, collected in the table "imdbdf2", each of the actors has played a major role. The resulting values will be transferred to a separate table "actors", and its columns will be named "Actor" and "No_movies", i.e. the name of the actor and the number of movies in which he played the main role:

The created table consists of 1457 lines, and the sum on the 2nd column, which shows the number of all movies, is 3701, that is the same as the number of rows within "imdbdf2":

Then we arrange actors in the "actors" by the number of movies with their participation in descending order:We select 10 actors with the largest numbers of main roles they has played. Denote this sample via "actorsTop10":

The "actorsTop10" header for the first three lines looks like this:

The top ten actors count 294 movies, which is 7.9% of the total number of movies in "imdbdf2":



Further we will select the instances from "imdbdf2", that correspond to the actors on the column "actor_1_name" of "actorsTop10", for example, Robert De Niro and Johnny Depp:


The obtained table consists of 80 rows and 6 columns.

Let's compare these actors in the dynamics of movie gross; movies are those they participated in:


To see the dynamics of IMDb ratings (imdb_score) for movies with title role starred by Robert de Niro or Johnny Depp, we will record:



To get more information, for example, about Robert De Niro, just enter:

The table below contains information about Robert De Niro and Johnny Depp, who occupy the 1st and 2nd position in the Top 10 formed in accordance with the number of played main roles. The table shows that Robert De Niro have played in 42 movies, while Johnny Depp in 34. The movie that has maximum score equal to 9 refers to De Niro. In terms of income of a movie series, the advantage belongs to J. Depp. Particullarly, the average gross for movies De Niro starred in is just over $ 50 million. Movie rental of with the Depp's starring has given almost $ 94.4 million on average.




We go to selection of movies for all the actors who are among the Top 10 ordered by the total number of main roles they played:


Let's check if "proba" has duplicate instances, and exclude them from this table:



There is a plot of IMDb movie scores for ten actors by year:

Graphical display of the ratings dynamics by years for each actor is set by commands:


Perhaps it would be useful to study the total rating of the selected movies by year, taking into account the contribution of each actor from the Top 10:

Based on an average rating of movies (no less than 15), in which an actor played the main role, Leonardo DiCaprio leads, followed by Tom Hanks.




There is average rating list of actors who are included in the Top 10 below:




At the same time, Top-10 actors who most often starred in main role is as follows:







So, a list of 18 applicants for the title of best actor has been received. Next comes the question of which method to use to choose the best.


Refernces


bottom of page