In this blog, we are going to learn about “How to create graphs in R using GGPLOT Library “. Before going further first discuss some basic fundamentals of R. R is open source language which you can use for data analysis and statistical computing. There are total 7800 packages available in R.
How to install R in windows?
To get started we first need to visit the website https://cran.r-project.org. You can download R for Mac and Windows both. Select the option what you like. This will take you to packages of R. You need to select the latest package which is R-3.2.3.pkg.
In order to download just check this link https://www.rstudio.com. Install both the setup files. You are ready to start working with R.
The working environment of R
Open the RStudio in your system. Then, you will see the screen is divided into four sections. Let us discuss each section one by one.
- RConsole – This area shows the output of code you run.
- RScript – Here you get space to write codes. To compile your code first select the code and then press CTRL + Enter. The result will show on the console window.
- REnvironemnt- To check the data if it has been properly loaded in R, then always look at this area.
- Graphical Output – This space shows graphs created during exploratory data analysis.
Packages in R
Packages are basically collections of R functions, data and compiled code in a well-defined format. The directory where all packages are stored is “Library” folder. In this blog, we are going to use ggplot2 library. So first download it to make your work easier.
What are Data Frames?
Basically, there are three data types in R.
- Data Frames
We are only discussing data frames here. We are not going into the basic details of R. If you are interested in learning the basics of R then click this link https://www.loginworks.com/blogs/how-to-begin-in-r-programming/.
When a vector is represented in the form of rows and columns, it becomes the data frame. It consists of elements of the different class. Let’s understand this with an example. We have a csv file in our directory named Demographic Data. First, import this file and save this file into a variable using the below code
stats <- read.csv("DemographicData.csv") stats head(stats) tail(stats)
To find the top 6 rows use the head function and to find the last 6 rows use the tail function.
Basic Operation with a data frame
If you want to add a column in a data frame then you need to use $ sign.To use this write this code in your script.
stats$Mycalc <- stats$birthrate * stats$Internetusers
This code will add a new column Mycalc having the values of the product of values of above two columns. You can also filter the data. Let’s see how to use this. Suppose there is a column named InternetUsers. I need values in this column which is greater than 2.
filter <- stats$InternetUsers <2 stats[filter,]
How to create Data Frames?
Suppose You have three vectors C1, C2, C3 with some data in each vector>to create a data frame mydf <- data.frame(c1,c2,c3) head(mydf)
To change the column names use this code
colnames(mydf) <- c("Country","Code","Region")
How to plot a graph using ggplot2 library?
We are visualizing ratings of movies Hollywood movies mostly between 2007 and 2011. We will create lots of graphs in this. In order to get the dataset, we need to go to the link www.superdatascience.com/rcourse.
Go to section 6 Advanced visualizations and download the movie ratings set. The file will look like this
Grammar of Graphics
There are total seven layers which we are going to follow step by step to create the graph. These are
You cannot see the data. It is present somewhere else like in csv files or text files. We will use this data to create data frames and show them in our graphs.
Aesthetics is same as data which you cannot see. It maps your data to the chart like colour you are using or what you are going to represent on x-axis and y-axis. These Come under Aesthetics. It simply means how is the data mapped.
The geometry is the shape you want to present into your graph. It may be a scatter plot or maybe bar graph or a histogram
While creating graphs we combined geometry and statistics. Statistical is done by grouping the data like we want to group the 500 rows in our graph. So we group them on the basis of Genre by giving color to Genre. In simple words doing something to existing data to create new variables that you are going to visualize
Let us assume you have a chart for the movie ratings between the year 2007 to 2011. If you want to show the data year wise i.e one graph for each year then this will be done by facets.
Sometimes charts are very large in size. We are not able to see the complete graph. So we provide coordinates to show the exact coordinates in the graph. This is simply used to zoom in the graph.
Everything you see in the charts that is not related to data is the part of theme like Title of the graph, positioning of legends, size of the text, font family etc.
Creating a graph
First, read the csv files into your console using this command and change the column names too
movies <- read.csv("Movie Ratings.csv") head(movies) colnames(movies)=c("Film","Genre","CriticRating","AudienceRating","BudgetMillions","Year")
Now, download the ggplot2 package
Now we got the data then go to the second step add Aesthetics
# add aesthetics and geometry p <-ggplot(data=movies,aes(x=BudgetMillions,y=AudienceRating,size=BudgetMillions,color=Genre)) p+geom_point(aes(x=BudgetMillions)) + xlab("BudgetMillions $$$")
Run these lines you will get a scatter plot. Then the result will look like this. What we are doing here is our data is saved in movies variable so we are assigning it to data. Then we are adding aesthetics in which x-axis is going to show us CriticRating data and y-axis AudienceRating and we are giving color to Genre column and size of scatter plot is BudgetMillions. The result can be seen below. xlab() function is used to change the title of axis here.
How to create a histogram?
s <- ggplot(data=movies,aes(x=BudgetMillions)) s+geom_histogram(binwidth = 10,aes(fill=Genre),color="Black")
To create a histogram, we are using histogram geometry. We are following the same procedure as above. We are assigning BudgetMillions on x-axis and to fill the color we used fill property and for the order, we are assigning color to Black. The result looks like this.
How to create a box plot?
u <- ggplot(data=movies,aes(x=Genre,y=AudienceRating,colour=Genre)) u+geom_jitter()+ geom_boxplot(size=1.2,alpha=0.5)
In this code, we are creating boxplot. We are assigning Genre and AudienceRating on x and y axis respectively and color we are using is Genre. Then we are adding two geometry boxplot and jitter in the same graph. Then alpha is used for transparency so that we can see both geometries clearly. The result will look like this.
And, last we are going to discuss something about facets. In this graph, we will add facets. We are going to count how many movies are there according to their budget. We will show each genre of the movie separately.
v <- ggplot(data=movies,aes(x=BudgetMillions)) v + geom_histogram(binwidth=10,aes(fill=Genre),color="Black")+facet_grid(Genre~.,scale="free")
In this chart, we are displaying budget of movies on x-axis and count on y-axis and we are creating different graphs for each Genre using facets function in which we are putting Genre on the left-hand side and .(dot) represent all means Genre vs all. We can also specify a column name instead of dot and scale equals free will show the large length of bars.
This is the complete process of how you will create graphs using ggplot2 library. In this blog, we learned about ggplot2 library, how to install it and use it. Then we created different graphs. You can also practice on your own by using the same dataset. You can create any geometry within R with some few lines of coding. I hope that you all will like this. If you have any question regarding the topic or any confusion, you can share your views in the comment section. Thank you!