Data Visualization for Beginners-Part 3

July 9, 2018
8 mins

Bonjour! Welcome to another part of the series on data visualization techniques. In the previous two articles, we discussed different data visualization techniques that can be applied to visualize and gather insights from categorical and continuous variables. You can check out the first two articles here:

In this article, we’ll go through the implementation and use of a bunch of data visualization techniques such as heat maps, surface plots, correlation plots, etc. We will also look at different techniques that can be used to visualize unstructured data such as images, text, etc.

Heatmaps

A heat map(or heatmap) is a two-dimensional graphical representation of the data which uses colour to represent data points on the graph. It is useful in understanding underlying relationships between data values that would be much harder to understand if presented numerically in a table/ matrix.

Fig 1. Heatmap using the seaborn library

Let’s understand this using an example. We’ll be using the metadata from Deep Learning 3 challenge. Link to the dataset. Deep Learning 3 challenged the participants to predict the attributes of animals by looking at their images.

We will be analyzing how often an attribute occurs in relationship with the other attributes. To analyze this relationship, we will compute the co-occurrence matrix.

We can see that the values in the co-occurrence matrix represent the occurrence of each attribute with the other attributes. Although the matrix contains all the information, it is visually hard to interpret and infer from the matrix. To counter this problem, we will use heat maps, which can help relate the co-occurrences graphically.

Fig 2. Heatmap of the co-occurrence matrix indicating the frequency of occurrence of one attribute with other

Since the frequency of the co-occurrence is represented by a colour pallet, we can now easily interpret which attributes appear together the most. Thus, we can infer that these attributes are common to most of the animals.

Choropleth

Choropleths are a type of map that provides an easy way to show how some quantity varies across a geographical area or show the level of variability within a region. A heat map is similar but doesn’t include geographical boundaries. Choropleth maps are also appropriate for indicating differences in the distribution of the data over an area, like ownership or use of land or type of forest cover, density information, etc. We will be using the geopandas library to implement the choropleth graph.

We will be using choropleth graph to visualize the GDP across the globe. Link to the dataset.

 COUNTRY GDP (BILLIONS) CODE 0 Afghanistan 21.71 AFG 1 Albania 13.40 ALB 2 Algeria 227.80 DZA 3 American Samoa 0.75 ASM 4 Andorra 4.80 AND

Fig 3. Choropleth graph indicating the GDP according to geographical locations

Surface plot

Surface plots are used for the three-dimensional representation of the data. Rather than showing individual data points, surface plots show a functional relationship between a dependent variable (Z) and two independent variables (X and Y).

It is useful in analyzing relationships between the dependent and the independent variables and thus helps in establishing desirable responses and operating conditions.

One of the main applications of surface plots in machine learning or data science is the analysis of the loss function. From a surface plot, we can analyze how the hyperparameters affect the loss function and thus help prevent overfitting of the model.

Fig 4. Surface plot visualizing the dependent variable w.r.t the independent variables in 3-dimensions

Visualizing high-dimensional datasets

Dimensionality refers to the number of attributes present in the dataset. For example, consumer-retail datasets can have a vast amount of variables (e.g. sales, promos, products, open, etc.). As a result, visually exploring the dataset to find potential correlations between variables becomes extremely challenging.

Therefore, we use a technique called dimensionality reduction to visualize higher dimensional datasets. Here, we will focus on two such techniques :

• Principal Component Analysis (PCA)
• T-distributed Stochastic Neighbor Embedding (t-SNE)

Principal Component Analysis (PCA)

Before we jump into understanding PCA, let’s review some terms:

• Variance: Variance is simply the measure of the spread or extent of the data. Mathematically, it is the average squared deviation from the mean position.
• Covariance: Covariance is the measure of the extent to which corresponding elements from two sets of ordered data move in the same direction. It is the measure of how two random variables vary together. It is similar to variance, but where variance tells you the extent of one variable, covariance tells you the extent to which the two variables vary together. Mathematically, it is defined as:

A positive covariance means X and Y are positively related, i.e., if X increases, Y increases, while negative covariance means the opposite relation. However, zero variance means X and Y are not related.

Fig 5. Different types of covariance

PCA is the orthogonal projection of data onto a lower-dimension linear space that maximizes variance (green line) of the projected data and minimizes the mean squared distance between the data point and the projects (blue line). The variance describes the direction of maximum information while the mean squared distance describes the information lost during projection of the data onto the lower dimension.

Thus, given a set of data points in a d-dimensional space, PCA projects these points onto a lower dimensional space while preserving as much information as possible.

Fig 6. Illustration of principal component analysis

In the figure, the component along the direction of maximum variance is defined as the first principal axis. Similarly, the component along the direction of second maximum variance is defined as the second principal component, and so on. These principal components are referred to the new dimensions carrying the maximum information.

We can see that 98% (approx) variance of the data is along the first principal component, while the second component only expresses 1.6% (approx) of the data.

Fig 7. Visualizing the distribution of cancer across the data

Thus, with the help of PCA, we can get a visual perception of how the labels are distributed across given data (see Figure).

T-distributed Stochastic Neighbour Embedding (t-SNE)

T-distributed Stochastic Neighbour Embeddings (t-SNE) is a non-linear dimensionality reduction technique that is well suited for visualization of high-dimensional data. It was developed by Laurens van der Maten and Geoffrey Hinton. In contrast to PCA, which is a mathematical technique, t-SNE adopts a probabilistic approach.

PCA can be used for capturing the global structure of the high-dimensional data but fails to describe the local structure within the data. Whereas, “t-SNE” is capable of capturing the local structure of the high-dimensional data very well while also revealing global structure such as the presence of clusters at several scales. t-SNE converts the similarity between data points to joint probabilities and tries to maximize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embeddings and high-dimension data. In doing so, it preserves the original structure of the data.

Fig 8. Visualizing the feature space of the iris dataset using t-SNE

Thus, by reducing the dimensions using t-SNE, we can visualize the distribution of the labels over the feature space. We can see that in the figure the labels are clustered in their own little group. So, if we’re to use a clustering algorithm to generate clusters using the new features/components, we can accurately assign new points to a label.

Conclusion

Let’s quickly summarize the topics we covered. We started with the generation of heatmaps using random numbers and extended its application to a real-world example. Next, we implemented choropleth graphs to visualize the data points with respect to geographical locations. We moved on to implement surface plots to get an idea of how we can visualize the data in a three-dimensional surface. Finally, we used two- dimensional reduction techniques, PCA and t-SNE, to visualize high-dimensional datasets.

I encourage you to implement the examples described in this article to get a hands-on experience. Hope you enjoyed the article. Do let me know if you have any feedback, suggestions, or thoughts on this article in the comments below!

•
8
Shares
• 8
•
•
•