My friend Randy Olson and I got into the habit to argue about the relative qualities of our favorite languages for data analysis and visualization. I am an enthusiastic R user (www.r-project.org) while Randy is a fan of Python (www.python.org). One thing we agree on however is that our discussions are meaningless unless we actually put R and Python to a series of tests to showcase their relative strengths and weaknesses. Essentially we will set a common goal (e.g., perform a particular type of data analysis or draw a particular type of graph) and create the R and Python codes to achieve this goal. And since Randy and I are all about sharing, open source and open access, we decided to make public the results of our friendly challenges so that you can help us decide between R and Python and, hopefully, also learn something along the way.
Today’s challenge: where we learn that Hollywood’s cemetery is full
1 - Introduction
For this first challenge, we will use data collected by Randy for his recent post on the “Top 25 most violence packed
films” in the history of the movie industry. For his post,
Randy generated a simple horizontal barchart showing the top 25 more violent films ordered by number of on screen deaths
per minute. In the rest of this document, we will show you how to reproduce this graph using Python and how to achieve a
similar result with R. We will detail the different steps of the process and provide for each step the corresponding
code. You will also find the entire codes at the end of this document.
And now without further ado, let’s get started!
2 - Step by step process
First thing first, let’s set up our working environment by loading some necessary libraries.
For each movie, the data frame contains a column for the total number of on screen deaths (“Body_Count”) and a column for
the duration (“Length_Minutes”). We will now create an extra column for the number of on screen deaths per minute of each
movie (“Deaths_Per_Minute”)
Now we will reorder the data frame by (descending) number of on screen deaths per minute, and select the top 25 most
violent movies according to this criterion.
In Randy’s graph, the “y” axis shows the film title with the release date. We will now generate the full title for each
movie following a “Movie name (year)” format, and append it to the data frame.
Now we are ready to generate the barchart. We’re going to start with the default options and then we will make this thing
look pretty.
Ok, now let’s make this pretty.
Finally, the last thing we want to add to our graph is the number of deaths per minute and the duration of each movie on
the right of the graph.
3 - R bonus
Just for fun, I decided to add to the R graph a little accessory in relation with the general theme of this data set.
For F# fan, Terje Tyldum has written his version of the code in F#
here.
Randy and I also recommend that you check out
this post
by Ramiro Gómez (@yaph) where he does a more in-depth analysis of the
data set we used for today’s challenge.