Introduction
“The best way to learn a new skill is by doing it!
This article is meant to help R users enhance their set of skills and learn Python for data science (from scratch). After all, R and Python are the most important programming languages a data scientist must know.
Python is a supremely powerful and a multi-purpose programming language. It has grown phenomenally in the last few years. It is used for web development, game development, and now data analysis / machine learning. Data analysis and machine learning is a relatively new branch in python.
For a beginner in data science, learning python for data analysis can be really painful. Why?
You try Googling "learn python," and you'll get tons of tutorials only meant for learning python for web development. How can you find a way then?
In this tutorial, we'll be exploring the basics of python for performing data manipulation tasks. Alongside, we'll also look how you do it in R. This parallel comparison will help you relate the set of tasks you do in R to how you do it in python! And in the end, we'll take up a data set and practice our newly acquired python skills.
Note: This article is best suited for people who have a basic knowledge of R language.
Table of Contents
- Why learn Python (even if you already know R)
- Understanding Data Types and Structures in Python vs. R
- Writing Code in Python vs. R
- Practicing Python on a Data Set
Why learn Python (even if you already know R)
No doubt, R is tremendously great at what it does. In fact, it was originally designed for doing statistical computing and manipulations. Its incredible community support allows a beginner to learn R quickly.
But, python is catching up fast. Established companies and startups have embraced python at a much larger scale compared to R.
According to indeed.com (from Jan 2016 to November 2016), the number of job postings seeking "machine learning python" increased much faster (approx. 123%) than "machine learning in R" jobs. Do you know why? It is because
- Python supports the entire spectrum of machine learning in a much better way.
- Python not only supports model building but also supports model deployment.
- The support of various powerful deep learning libraries such as keras, convnet, theano, and tensorflow is more for python than R.
- You don't need to juggle between several packages to locate a function in python unlike you do in R. Python has relatively fewer libraries, with each having all the functions a data scientist would need.
Understanding Data Types and Structures in Python vs. R
These programming languages understand the complexity of a data set based on its variables and data types. Yes! Let's say you have a data set with one million rows and 50 columns. How would these programming languages understand the data?
Basically, both R and Python have pre-defined data types. The dependent and independent variables get classified among these data types. And, based on the data type, the interpreter allots memory for use. Python supports the following data types:
- Numbers – It stores numeric values. These numeric values can be stored in 4 types: integer, long, float, and complex.
- Integer – Whole numbers such as 10, 13, 91, 102. Same as R's
integer
type. - Long – Long integers in octa and hexadecimal. R uses
bit64
package for hexadecimal. - Float – Decimal values like 1.23, 9.89. Equivalent to R's
numeric
type. - Complex – Numbers like 2 + 3i, 5i. Rarely used in data analysis.
- Integer – Whole numbers such as 10, 13, 91, 102. Same as R's
- Boolean – Stores two values (True and False). R uses
factor
orcharacter
. Case-sensitive difference exists: R uses TRUE/FALSE; Python uses True/False. - Strings – Stores text like "elephant", "lotus". Same as R's
character
type. - Lists – Like R’s list, stores multiple data types in one structure.
- Tuples – Similar to immutable vectors in R (though R has no direct equivalent).
- Dictionary – Key-value pair structure. Think of keys as column names, values as data entries.
Since R is a statistical computing language, all the functions to manipulate data and reading variables are available inherently. On the other hand, python hails all the data analysis / manipulation / visualization functions from external libraries. Python has several libraries for data manipulation and machine learning. The most important ones are:
- Numpy – Used for numerical computing. Offers math functions and array support. Similar to R’s list or array.
- Scipy – Scientific computing in python.
- Matplotlib – For data visualization. R uses
ggplot2
. - Pandas – Main tool for data manipulation. R uses
dplyr
,data.table
. - Scikit Learn – Core library for machine learning algorithms in python.
In a way, python for a data scientist is largely about mastering the libraries stated above. However, there are many more advanced libraries which people have started using. Therefore, for practical purposes you should remember the following things:
- Array – Similar to R's
list
, supports multidimensional data with coercion effect when data types differ. - List – Equivalent to R’s list.
- Data Frame – Two-dimensional structure composed of lists. R uses
data.frame
; python usesDataFrame
from pandas. - Matrix – Multidimensional structure of same class data. In R:
matrix()
; in python:numpy.column_stack()
.
Until here, I hope you've understood the basics of data types and data structures in R and Python. Now, let's start working with them!
Writing Code in Python vs. R
Let's use the knowledge gained in the previous section and understand its practical implications. But before that, you should install python using Anaconda's Jupyter Notebook. You can download here. Also, you can download other python IDEs. I hope you already have R Studio installed.
1. Creating Lists
In R:
my_list <- list('monday','specter',24,TRUE)
typeof(my_list)
[1] "list"
In Python:
my_list = ['monday','specter',24,True]
type(my_list)
list
Using pandas Series:
import pandas as pd
pd_list = pd.Series(my_list)
pd_list
0 monday
1 specter
2 24
3 True
dtype: object
Python uses zero-based indexing; R uses one-based indexing.
2. Matrix
In R:
my_mat <- matrix(1:10, nrow = 5)
my_mat
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
# Select first row
my_mat[1,]
# Select second column
my_mat[,2]
In Python (using NumPy):
import numpy as np
a = np.array(range(10,15))
b = np.array(range(20,25))
c = np.array(range(30,35))
my_mat = np.column_stack([a, b, c])
# Select first row
my_mat[0,]
# Select second column
my_mat[:,1]
3. Data Frames
In R:
data_set <- data.frame(Name = c("Sam","Paul","Tracy","Peter"),
Hair_Colour = c("Brown","White","Black","Black"),
Score = c(45,89,34,39))
In Python:
data_set = pd.DataFrame({'Name': ["Sam","Paul","Tracy","Peter"],
'Hair_Colour': ["Brown","White","Black","Black"],
'Score': [45,89,34,39]})
Selecting columns:
In R:
data_set$Name
data_set[["Name"]]
data_set[1]
data_set[c('Name','Hair_Colour')]
data_set[,c('Name','Hair_Colour')]
In Python:
data_set['Name']
data_set.Name
data_set[['Name','Hair_Colour']]
data_set.loc[:,['Name','Hair_Colour']]
Practicing Python on a Data Set
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
boston.keys()
['data', 'feature_names', 'DESCR', 'target']
print(boston['feature_names'])
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
print(boston['DESCR'])
bos_data = pd.DataFrame(boston['data'])
bos_data.head()
bos_data.columns = boston['feature_names']
bos_data.head()
bos_data.describe()
# First 10 rows
bos_data.iloc[:10]
# First 5 columns
bos_data.loc[:, 'CRIM':'NOX']
bos_data.iloc[:, :5]
# Filter rows
bos_data.query("CRIM > 0.05 & CHAS == 0")
# Sample
bos_data.sample(n=10)
# Sort
bos_data.sort_values(['CRIM']).head()
bos_data.sort_values(['CRIM'], ascending=False).head()
# Rename column
bos_data.rename(columns={'CRIM': 'CRIM_NEW'})
# Column means
bos_data[['ZN','RM']].mean()
# Transform numeric to categorical
bos_data['ZN_Cat'] = pd.cut(bos_data['ZN'], bins=5, labels=['a','b','c','d','e'])
# Grouped sum
bos_data.groupby('ZN_Cat')['AGE'].sum()
# Pivot table
bos_data['NEW_AGE'] = pd.cut(bos_data['AGE'], bins=3, labels=['Young','Old','Very_Old'])
bos_data.pivot_table(values='DIS', index='ZN_Cat', columns='NEW_AGE', aggfunc='mean')
Summary
While coding in python, I realized that there is not much difference in the amount of code you write here; although some functions are shorter in R than in Python. However, R has really awesome packages which handle big data quite conveniently. Do let me know if you wish to learn about them!
Overall, learning both the languages would give you enough confidence to handle any type of data set. In fact, the best part about learning python is its comprehensive documentation available on numpy, pandas, and scikit learn libraries, which are sufficient enough to help you overcome all initial obstacles.
In this article, we just touched the basics of python. There's a long way to go. Next week, we'll learn about data manipulation in python in detail. After that, we'll look into data visualization, and the powerful machine learning library in python.
Do share your experience, suggestions, and questions below while practicing this tutorial!