Find out how HackerEarth can boost your tech recruiting

Learn more
piller_image

Data visualization packages in R-Part I

Data visualization in R

A good understanding of data is one of the key essentials to designing effective Machine Learning (ML) algorithms. Realizing the structure and properties of the data that you are working with is crucial in devising new methods that can solve your problem. Visualizing data includes the following:

  • Cleaning up the data
  • Structuring and arranging it
  • Plotting it to understand the granularity

R, one of the few widely-used programming languages for ML, has many data-visualization libraries.

In this article, we will explore two of the most commonly used packages in R for analyzing data?—dplyr and tidyr.

Using dplyr and tidyr

dplyr and tidyr, which were created by Hadley Wickham, are maintained by Hadley Wickham and the RStudio team. An impressive advantage with these libraries is the pipeline operator “%>% that can be used with any of the functions in these libraries. It acts as a pipeline by supplying the output of the left argument as an input to the right-side function thus giving us the flexibility to apply any no. of functions in sequence.

 

tidyr

The operations in the tidyr library include the following:

  1. gather() and its complement spread()
  2. separate() and its complement unite()


gather()

This operation is used to make messy data more structured by gathering all the information into key-value pairs. It takes in a data frame and duplicates the columns as required and then organizes the data into key-value pairs. This is an important task when information about the same attributes is spread across multiple columns in a data frame.

Usage:

gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)

OR

data %>% gather( key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
key, value names of the columns to be created in output
‘…’ indicates the specification regarding the columns to be included while gathering information. Either single variable names can be used to specify the column names or a list of column names using “:” operator can also be mentioned. To exclude a column from a list, “-“ operator has to be used before the name of that particular column
na.rm takes a Boolean value and is an indicator whether to discard the rows with missing values in output. If set TRUE, the rows with missing values are discarded
convert takes a Boolean value and is used to activate type conversion on key value column
factor_key when set to FALSE treats the key values as character vector. Else they are perceived as factors and the original ordering of the columns are retained
> gather(stocks, stock, price, -time)
# A tibble: 30 x 3
         time stock      price
       <date> <chr>      <dbl>
1  2009-01-01     X  0.1223951
2  2009-01-02     X -1.6043815
3  2009-01-03     X  1.3915413
4  2009-01-04     X -0.3061332
5  2009-01-05     X -0.9642182
6  2009-01-06     X  0.9444028
7  2009-01-07     X  0.7419676
8  2009-01-08     X -1.0001474
9  2009-01-09     X  2.1008638
10 2009-01-10     X  0.5348490
# ... with 20 more rows

spread()

This operation does the opposite of gather(). It breaks the key-value pair structure and spreads them across multiple columns.

Usage

spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
key, value meanings remain same
fill the value with which both explicit and implicit missing values will be filled in the final output
drop takes a Boolean value and when set to FALSE, will retain the factors with missing values by filling them with “fill” value
sep when set to NULL has no special effect on the names of the columns but when it is not NULL, the column values are given by “key_name”+ “sep” + “key_value”
> gather(stocks, stock, price, X:Z) %>% spread(stock, price)
# A tibble: 10 x 4
         time          X          Y          Z
*      <date>      <dbl>      <dbl>      <dbl>
1  2009-01-01  0.1223951 -0.3321443 -2.4538854
2  2009-01-02 -1.6043815 -0.6105896  7.8490067
3  2009-01-03  1.3915413  1.0549784 -3.9146167
4  2009-01-04 -0.3061332  1.8282398  8.6643264
5  2009-01-05 -0.9642182  2.6306931  5.1099932
6  2009-01-06  0.9444028 -0.8407392  1.2753129
7  2009-01-07  0.7419676 -1.4761008  5.1857155
8  2009-01-08 -1.0001474  1.2178356 -1.9425926
9  2009-01-09  2.1008638  1.1833792 -0.2438280
10 2009-01-10  0.5348490  0.6174628 -0.1822071

separate()

This operation is used when a single column contains multiple information. It breaks the column information and builds two independent columns for easier access of data.

unite()

This operation performs the negation of separate().

> unite(stocks,time_X, time, X, sep = " // ") %>% separate(time_X, c("time", "X"), sep = " // ")
# A tibble: 10 x 4
         time                  X          Y          Z
*       <chr>              <chr>      <dbl>      <dbl>
1  2009-01-01  0.122395115962805 -0.3321443 -2.4538854
2  2009-01-02  -1.60438146101809 -0.6105896  7.8490067
3  2009-01-03   1.39154128073248  1.0549784 -3.9146167
4  2009-01-04 -0.306133216903522  1.8282398  8.6643264
5  2009-01-05 -0.964218188445317  2.6306931  5.1099932
6  2009-01-06  0.944402751900942 -0.8407392  1.2753129
7  2009-01-07  0.741967646727019 -1.4761008  5.1857155
8  2009-01-08  -1.00014741958864  1.2178356 -1.9425926
9  2009-01-09   2.10086382498772  1.1833792 -0.2438280
10 2009-01-10   0.53484896281827  0.6174628 -0.1822071

dplyr

The operations and internal methods (that act as facilitators) of the dplyr library include the following:

1. select()

This method is used to select a specific subset of the variables that are present in the library. A smaller set of original data can be separated by using the names of the columns :

  1. Using a regular expression: select(df, matches(reg))
  2. Using a character string “str”  which is a part of the column name: select(df, fun(str)), fun can take the following values:
    • contains(str): {All the columns that contain the character string str are selected}
    • ends_with(str): Selected when the column name ends with str
    • starts_with(str): Selected when the column name starts with str
    • num_range(“c”,1:N) – selects all the columns with the names c1, c2, c3, …cN
    • Using a set of column names:  select(df, one_of(c(“column1”, “column2”,…))) and a specific column could be removed by mentioning it with a negation: select(df, -column1)
    • To select everything use : everything(): Used to select all columns of the table
> select(stocks,X,time)
# A tibble: 10 x 2
            X       time
        <dbl>     <date>
1   0.1223951 2009-01-01
2  -1.6043815 2009-01-02
3   1.3915413 2009-01-03
4  -0.3061332 2009-01-04
5  -0.9642182 2009-01-05
6   0.9444028 2009-01-06
7   0.7419676 2009-01-07
8  -1.0001474 2009-01-08
9   2.1008638 2009-01-09
10  0.5348490 2009-01-10
>
>
> select(stocks,X,contains('i'))
# A tibble: 10 x 2
            X       time
        <dbl>     <date>
1   0.1223951 2009-01-01
2  -1.6043815 2009-01-02
3   1.3915413 2009-01-03
4  -0.3061332 2009-01-04
5  -0.9642182 2009-01-05
6   0.9444028 2009-01-06
7   0.7419676 2009-01-07
8  -1.0001474 2009-01-08
9   2.1008638 2009-01-09
10  0.5348490 2009-01-10
> 
>
>
> select(stocks,-Z)
# A tibble: 10 x 3
         time          X          Y
       <date>      <dbl>      <dbl>
1  2009-01-01  0.1223951 -0.3321443
2  2009-01-02 -1.6043815 -0.6105896
3  2009-01-03  1.3915413  1.0549784
4  2009-01-04 -0.3061332  1.8282398
5  2009-01-05 -0.9642182  2.6306931
6  2009-01-06  0.9444028 -0.8407392
7  2009-01-07  0.7419676 -1.4761008
8  2009-01-08 -1.0001474  1.2178356
9  2009-01-09  2.1008638  1.1833792
10 2009-01-10  0.5348490  0.6174628


2. Extracting data by using observations through the following:

  1. distinct(df): Removes duplicate rows
  2. sample_frac(df, f, replace): – f is a fraction value between 0 and 1. Replace takes up a Boolean value and when set to TRUE will select fraction f rows randomly with replacement. sample_n is similar but selects n rows randomly.
  3. top_n(df, n, par): Selects top n rows based on the parameter par.
  4. slice(df, N1:N2): Slices the rows from N1 to N2 out of original data.
  5. filter(df, req) is also used for the similar purpose of reducing the no. of observations present in data.
> 
> sample_frac(stocks,0.5,replace = TRUE)
# A tibble: 5 x 4
        time          X          Y         Z
      <date>      <dbl>      <dbl>     <dbl>
1 2009-01-06  0.9444028 -0.8407392  1.275313
2 2009-01-06  0.9444028 -0.8407392  1.275313
3 2009-01-01  0.1223951 -0.3321443 -2.453885
4 2009-01-08 -1.0001474  1.2178356 -1.942593
5 2009-01-01  0.1223951 -0.3321443 -2.453885
>
> sample_frac(stocks,0.5,replace = FALSE)
# A tibble: 5 x 4
        time          X          Y         Z
      <date>      <dbl>      <dbl>     <dbl>
1 2009-01-04 -0.3061332  1.8282398  8.664326
2 2009-01-09  2.1008638  1.1833792 -0.243828
3 2009-01-06  0.9444028 -0.8407392  1.275313
4 2009-01-02 -1.6043815 -0.6105896  7.849007
5 2009-01-03  1.3915413  1.0549784 -3.914617
>
> sample_frac(stocks,0.5,replace = TRUE) %>% distinct()
# A tibble: 3 x 4
        time          X          Y          Z
      <date>      <dbl>      <dbl>      <dbl>
1 2009-01-10  0.5348490  0.6174628 -0.1822071
2 2009-01-07  0.7419676 -1.4761008  5.1857155
3 2009-01-04 -0.3061332  1.8282398  8.6643264
> sample_frac(stocks,0.5,replace = TRUE) %>% slice(1:3) %>% distinct()
# A tibble: 3 x 4
        time         X          Y          Z
      <date>     <dbl>      <dbl>      <dbl>
1 2009-01-10 0.5348490  0.6174628 -0.1822071
2 2009-01-06 0.9444028 -0.8407392  1.2753129
3 2009-01-01 0.1223951 -0.3321443 -2.4538854


3. group_by(df, variable): Groups the data based on the given variable. This is a silent function and does not show any visual effect on the data, however, it is effective when applied in the pipeline along with other functions.

4. summarise(): This operation is used to collect the summary of the data frame.

  1. To apply a summary function over all the columns use summary_each(df, funs(sum_func)). sum_func can take the following values:
    • First, last, nth {denotes the particular value of each vector}, n and n_distinct denote the no. of values and no. of distinct values in each vector respectively.
    • Other functions are, min, max, median, sd, var and mean.

Usage:

summarise(df, col_name = sum_func(column_set))
> gather(stocks, stock, price, -time) %>% group_by(time)
Source: local data frame [30 x 3]
Groups: time [10]

         time stock      price
       <date> <chr>      <dbl>
1  2009-01-01     X  0.1223951
2  2009-01-02     X -1.6043815
3  2009-01-03     X  1.3915413
4  2009-01-04     X -0.3061332
5  2009-01-05     X -0.9642182
6  2009-01-06     X  0.9444028
7  2009-01-07     X  0.7419676
8  2009-01-08     X -1.0001474
9  2009-01-09     X  2.1008638
10 2009-01-10     X  0.5348490
# ... with 20 more rows
> gather(stocks, stock, price, -time) %>% group_by(time) %>% summarise(avg = mean(price))
# A tibble: 10 x 2
         time        avg
       <date>      <dbl>
1  2009-01-01 -0.8878782
2  2009-01-02  1.8780119
3  2009-01-03 -0.4893657
4  2009-01-04  3.3954776
5  2009-01-05  2.2588227
6  2009-01-06  0.4596588
7  2009-01-07  1.4838608
8  2009-01-08 -0.5749682
9  2009-01-09  1.0134717
10 2009-01-10  0.3233682


5. mutate() is used to compute a new value and add it as an additional variable or to simply add a new variable to the data frame.

Usage:

mutate(new_col = col1/col2).

There are various window functions present in dplyr and in R that can used along with mutate to compute a new value for each vector in data frame. 

> gather(stocks, stock, price, -time) %>% group_by(time) %>% summarise(avg = mean(price))
# A tibble: 10 x 2
         time        avg
       <date>      <dbl>
1  2009-01-01 -0.8878782
2  2009-01-02  1.8780119
3  2009-01-03 -0.4893657
4  2009-01-04  3.3954776
5  2009-01-05  2.2588227
6  2009-01-06  0.4596588
7  2009-01-07  1.4838608
8  2009-01-08 -0.5749682
9  2009-01-09  1.0134717
10 2009-01-10  0.3233682
>
>
> mutate(stocks, avg = (X + Y+ Z)/3)
# A tibble: 10 x 5
         time          X          Y          Z        avg
       <date>      <dbl>      <dbl>      <dbl>      <dbl>
1  2009-01-01  0.1223951 -0.3321443 -2.4538854 -0.8878782
2  2009-01-02 -1.6043815 -0.6105896  7.8490067  1.8780119
3  2009-01-03  1.3915413  1.0549784 -3.9146167 -0.4893657
4  2009-01-04 -0.3061332  1.8282398  8.6643264  3.3954776
5  2009-01-05 -0.9642182  2.6306931  5.1099932  2.2588227
6  2009-01-06  0.9444028 -0.8407392  1.2753129  0.4596588
7  2009-01-07  0.7419676 -1.4761008  5.1857155  1.4838608
8  2009-01-08 -1.0001474  1.2178356 -1.9425926 -0.5749682
9  2009-01-09  2.1008638  1.1833792 -0.2438280  1.0134717
10 2009-01-10  0.5348490  0.6174628 -0.1822071  0.3233682
> 
> 


6. dplyr provides methods to perform various types of join functions too. 

Usage:

left_join(tab_1, tab_2, by = “col_1”)

Function list

  1. left_join() ? performs the join of rows matching in tab_2 with tab_1
  2. right_join() ? performs the join of rows matching in tab_1 with tab_2
  3. inner_join() ? performs the join of only rows present in both tab_1 and tab_2
  4. full_join() ? performs the join of all the rows present in both the tables irrespective of matching

The 4 join operations, which are mentioned above, are known as mutating joins (that add some data to the existing ones). semi_join(), anti_join() are known as filter joins as they filter out the data based on the given condition and refine it out. There are other merging functions available like intersect(), union(), setdiff(), bind_rows(), bind_cols() and combine().

> semi_join(sample_frac(select(stocks, time, X), 0.5, replace = TRUE), sample_frac(select(stocks, time, Y), 0.5, replace = TRUE), by = "time")
# A tibble: 2 x 2
        time         X
      <date>     <dbl>
1 2009-01-02 -1.604381
2 2009-01-02 -1.604381
> anti_join(sample_frac(select(stocks, time, X), 0.5, replace = TRUE), sample_frac(select(stocks, time, Y), 0.5, replace = TRUE), by = "time")
# A tibble: 3 x 2
        time          X
      <date>      <dbl>
1 2009-01-02 -1.6043815
2 2009-01-07  0.7419676
3 2009-01-07  0.7419676
> setdiff(sample_frac(select(stocks, time, X), 0.5, replace = TRUE), sample_frac(select(stocks, time, X), 0.5, replace = TRUE))
# A tibble: 3 x 2
        time          X
      <date>      <dbl>
1 2009-01-05 -0.9642182
2 2009-01-04 -0.3061332
3 2009-01-06  0.9444028


7. Another important feature of dplyr is the arrange() function that lets us sort the data in ascending order based on a specific column value. If required, the data can be sorted in descending order using desc() function.

  1. Implementation: arrange(column_name) — for ascending order
  2. arrange(desc(col_name)) — for descending order

Using functions in ggplot2() will help us visualize the data in a more efficient way through plotting them.

The examples below explain how raw data can be processed by using the  functions mentioned above and then be plotted to visualize the average change in value of stock for every month of the year.

Example

For bar graph visualization with time as color coding parameter

ggplot(data=gather(stocks, stock, price, -time) %>% group_by(time) %>% summarise(avg = mean(price)), aes(x=time, y=avg, fill=time)) + geom_bar(stat="identity")


bargraph

For scatter plot of same data

qplot(time,avg, data = gather(stocks, stock, price, -time) %>% group_by(time) %>% summarise(avg = mean(price)), colour = 'red', main = "Avg change of stock price for each month", xlab = "month", ylab = "avg price")

scaplot


Regression models are designed to understand patterns that cannot be observed easily in the data. R equips you with libraries similar to those that we discussed in this article. If you think deeper, they are not only data-visualization tools, but also tools to improve your aptitude towards understanding granular aspects of data.

In the next article, we’ll explore more data-visualization libraries.

Hackerearth Subscribe

Get advanced recruiting insights delivered every month

Related reads

Best Interview Questions For Assessing Tech Culture Fit in 2024
Best Interview Questions For Assessing Tech Culture Fit in 2024

Best Interview Questions For Assessing Tech Culture Fit in 2024

Finding the right talent goes beyond technical skills and experience. Culture fit plays a crucial role in building successful teams and fostering long-term…

Best Hiring Platforms in 2024: Guide for All Recruiters
Best Hiring Platforms in 2024: Guide for All Recruiters

Best Hiring Platforms in 2024: Guide for All Recruiters

Looking to onboard a recruiting platform for your hiring needs/ This in-depth guide will teach you how to compare and evaluate hiring platforms…

Best Assessment Software in 2024 for Tech Recruiting
Best Assessment Software in 2024 for Tech Recruiting

Best Assessment Software in 2024 for Tech Recruiting

Assessment software has come a long way from its humble beginnings. In education, these tools are breaking down geographical barriers, enabling remote testing…

Top Video Interview Softwares for Tech and Non-Tech Recruiting in 2024: A Comprehensive Review
Top Video Interview Softwares for Tech and Non-Tech Recruiting in 2024: A Comprehensive Review

Top Video Interview Softwares for Tech and Non-Tech Recruiting in 2024: A Comprehensive Review

With a globalized workforce and the rise of remote work models, video interviews enable efficient and flexible candidate screening and evaluation. Video interviews…

8 Top Tech Skills to Hire For in 2024
8 Top Tech Skills to Hire For in 2024

8 Top Tech Skills to Hire For in 2024

Hiring is hard — no doubt. Identifying the top technical skills that you should hire for is even harder. But we’ve got your…

How HackerEarth and Olibr are Reshaping Tech Talent Discovery
How HackerEarth and Olibr are Reshaping Tech Talent Discovery

How HackerEarth and Olibr are Reshaping Tech Talent Discovery

In the fast-paced tech world, finding the right talent is paramount. For years, HackerEarth has empowered tech recruiters to identify top talent through…

Hackerearth Subscribe

Get advanced recruiting insights delivered every month

View More

Top Products

Hackathons

Engage global developers through innovation

Hackerearth Hackathons Learn more

Assessments

AI-driven advanced coding assessments

Hackerearth Assessments Learn more

FaceCode

Real-time code editor for effective coding interviews

Hackerearth FaceCode Learn more

L & D

Tailored learning paths for continuous assessments

Hackerearth Learning and Development Learn more