SQL for Data Analysis in R: Step-by-Step Tutorial

Introduction

Many people are pursuing data science as a career (to become a data scientist) choice these days. With the recent data deluge, companies are voraciously headhunting people who can handle, understand, analyze, and model data.

Be it college graduates or experienced professionals, everyone is busy searching for the best courses or training material to become a data scientist. Some of them even manage to learn Python or R, but still can't land their first analytics job!

What most people fail to understand is that the data science/analytics industry isn't just limited to using Python or R. There are several other coding languages which companies use to run their businesses.

Among all, the most important and widely used language is SQL (Structured Query Language). You must learn it.

I've realized that, as a newbie, learning SQL is somewhat difficult at home. After all, setting up a server enabled database engine isn't everybody's cup of tea. Isn't it? Don't you worry.

In this article, we'll learn all about SQL and how to write its queries.

Note: This article is meant to help R users who wants to learn SQL from scratch. Even if you are new to R, you can still check out this tutorial as the ultimate motive is to learn SQL here.

Why learn SQL ?
What is SQL?
Getting Started with SQL
- Data Selection
- Data Manipulation
- Strings & Dates
Practising SQL in R

Machine learning challenge, ML challenge

Why learn SQL ?

Good question! When I started learning SQL, I asked this question too. Though, I had no one to answer me. So, I decided to find it out myself.

SQL is the de facto standard programming language used to handle relational databases.

Let's look at the dominance / popularity of SQL in worldwide analytics / data science industry. According to an online survey conducted by Oreilly Media in 2016, it was found that among all the programming languages, SQL was used by 70% of the respondents followed by R and Python. It was also discovered that people who know Excel (Spreadsheet) tend to get significant salary boost once they learn SQL.

Also, according to a survey done by datasciencecentral, it was inferred that R users tend to get a nice salary boost once they learn SQL. In a way, SQL as a language is meant to complement your current set of skills.

Since 1970, SQL has remained an integral part of popular databases such as Oracle, IBM DB2, Microsoft SQL Server, MySQL, etc. Not only learning SQL with R will increase your employability, but SQL itself can make way for you in database management roles.

What is SQL ?

SQL (Structured Query Language) is a special purpose programming language used to manage, extract, and aggregate data stored in large relational database management systems.

In simple words, think of a large machine (rectangular shape) consisting of many, many boxes (again rectangles). Each box comprises a table (dataset). This is a database. A database is an organized collection of data. Now, this database understands only one language, i.e, SQL. No English, Japanese, or Spanish. Just SQL. Therefore, SQL is a language which interacts with the databases to retrieve data.

Following are some important features of SQL:

It allows us to create, update, retrieve, and delete data from the database.
It works with popular database programs such as Oracle, DB2, SQL Server, etc.
As the databases store humongous amounts of data, SQL is widely known for it speed and efficiency.
It is very simple and easy to learn.
It is enabled with inbuilt string and date functions to execute data-time conversions.

Currently, businesses worldwide use both open source and proprietary relational database management systems (RDBMS) built around SQL.

Getting Started with SQL

Let's try to understand SQL commands now. Most of these commands are extremely easy to pick up as they are simple "English words." But make sure you get a proper understanding of their meanings and usage in SQL context. For your ease of understanding, I've categorized the SQL commands in three sections:

Data Selection - These are SQL's indigenous commands used to retrieve tables from databases supported by logical statements.
Data Manipulation - These commands would allow you to join and generate insights from data.
Strings and Dates - These special commands would allow you to work diligently with dates and string variables.

Before we start, you must know that SQL functions recognize majorly four data types. These are:

Integers - This datatype is assigned to variables storing whole numbers, no decimals. For example, 123,324,90,10,1, etc.
Boolean - This datatype is assigned to variables storing TRUE or FALSE data.
Numeric - This datatype is assigned to variables storing decimal numbers. Internally, it is stored as a double precision. It can store up to 15 -17 significant digits.
Date/Time - This datatype is assigned to variables storing data-time information. Internally, it is stored as a time stamp.

That's all! If SQL finds a variable whose type is anything other than these four, it will throw read errors. For example, if a variable has numbers with a comma (like 432,), you'll get errors. SQL as a language is very particular about the sequence of commands given. If the sequence is not followed, it starts to throw errors. Don't worry I've defined the sequence below. Let's learn the commands. In the following section, we'll learn to use them with a data set.

Data Selection

SELECT - It tells you which columns to select.
FROM - It tells you columns to be selected should be from which table (dataset).
LIMIT - By default, a command is executed on all rows in a table. This command limits the number of rows. Limiting the rows leads to faster execution of commands.
WHERE - This command specifies a filter condition; i.e., the data retrieval has to be done based on some variable filtering.
Comparison Operators - Everyone knows these operators as (=, !=, <, >, <=, >=). They are used in conjunction with the WHERE command.
Logical Operators - The famous logical operators (AND, OR, NOT) are also used to specify multiple filtering conditions. Other operators include:
- LIKE - It is used to extract similar values and not exact values.
- IN - It is used to specify the list of values to extract or leave out from a variable.
- BETWEEN - It activates a condition based on variable(s) in the table.
- IS NULL - It allows you to extract data without missing values from the specified column.
ORDER BY - It is used to order a variable in descending or ascending order.

Data Manipulation

Aggregate Functions - These functions are helpful in generating quick insights from data sets.
- COUNT - It counts the number of observations.
- SUM - It calculates the sum of observations.
- MIN/MAX - It calculates the min/max and the range of a numerical distribution.
- AVG - It calculates the average (mean).
GROUP BY - For categorical variables, it calculates the above stats based on their unique levels.
HAVING - Mostly used for strings to specify a particular string or combination while retrieving data.
DISTINCT - It returns the unique number of observations.
CASE - It is used to create rules using if/else conditions.
JOINS - Used to merge individual tables. It can implement:
- INNER JOIN - Returns the common rows from A and B based on joining criteria.
- OUTER JOIN - Returns the rows not common to A and B.
- LEFT JOIN - Returns the rows in A but not in B.
- RIGHT JOIN - Returns the rows in B but not in A.
- FULL OUTER JOIN - Returns all rows from both tables, often with NULLs.
ON - Used to specify a column for filtering while joining tables.
UNION - Similar to rbind() in R. Combines two tables with identical variable names.

You can write complex join commands using comparison operators, WHERE, or ON to specify conditions.

Strings and Dates

NOW - Returns current time.
LEFT - Returns a specified number of characters from the left in a string.
RIGHT - Returns a specified number of characters from the right in a string.
LENGTH - Returns the length of the string.
TRIM - Removes characters from the beginning and end of the string.
SUBSTR - Extracts part of a string with specified start and end positions.
CONCAT - Combines strings.
UPPER - Converts a string to uppercase.
LOWER - Converts a string to lowercase.
EXTRACT - Extracts date components such as day, month, year, etc.
DATE_TRUNC - Rounds dates to the nearest unit of measurement.
COALESCE - Imputes missing values.

These commands are not case sensitive, but consistency is important. SQL commands follow this standard sequence:

SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT

Practising SQL in R

For writing SQL queries, we'll use the sqldf package. It activates SQL in R using SQLite (default) and can be faster than base R for some manipulations. It also supports H2 Java database, PostgreSQL, and MySQL.

You can easily connect database servers using this package and query data. For more details, check the GitHub repo by its author.

When using SQL in R, think of R as the database machine. Load datasets using read.csv or read.csv.sql and start querying. Ready? Let’s begin! Code every line as you scroll. Practice builds confidence.

We'll use the babynames dataset. Install and load it with:

> install.packages("babynames")
> library(babynames)
> str(babynames)

This dataset contains 1.8 million observations and 5 variables. The prop variable is the proportion of a name given in a year. Now, load the sqldf package:

> install.packages("sqldf")
> library(sqldf)

Let’s check the number of rows in this data.

> sqldf("select count(*) from mydata")
#1825433

Ignore the warnings here. Next, let's look at the data — the first 10 rows:

> sqldf("select * from mydata limit 10")

* selects all columns. To select specific variables:

> sqldf("select year, sex, name from mydata limit 10")

To rename a column in the output using AS:

> sqldf("select year, sex as 'Gender' from mydata limit 10")

Filtering data with WHERE and logical conditions:

> sqldf("select year, name, sex as 'Gender' from mydata where sex == 'F' limit 20")
> sqldf("select * from mydata where prop > 0.05 limit 20")
> sqldf("select * from mydata where sex != 'F'")
> sqldf("select year, name, 4 * prop as 'final_prop' from mydata where prop <= 0.40 limit 10")

Ordering data:

> sqldf("select * from mydata order by year desc limit 20")
> sqldf("select * from mydata order by year desc, n desc limit 20")
> sqldf("select * from mydata order by name limit 20")

Filtering with string patterns:

> sqldf("select * from mydata where name like 'Ben%'")
> sqldf("select * from mydata where name like '%man' limit 30")
> sqldf("select * from mydata where name like '%man%'")
> sqldf("select * from mydata where name in ('Coleman','Benjamin','Bennie')")
> sqldf("select * from mydata where year between 2000 and 2014")

Multiple filters with logical operators:

> sqldf("select * from mydata where year >= 1980 and prop < 0.5")
> sqldf("select * from mydata where year >= 1980 and prop < 0.5 order by prop desc")
> sqldf("select * from mydata where name != '%man%' or year > 2000")
> sqldf("select * from mydata where prop > 0.07 and year not between 2000 and 2014")
> sqldf("select * from mydata where n > 10000 order by name desc")

Basic aggregation:

> sqldf("select sum(n) as 'Total_Count' from mydata")
> sqldf("select min(n), max(n) from mydata")
> sqldf("select year, avg(n) as 'Average' from mydata group by year order by Average desc")
> sqldf("select year, count(*) as count from mydata group by year limit 100")
> sqldf("select year, n, count(*) as 'my_count' from mydata where n > 10000 group by year order by my_count desc limit 100")

Using HAVING instead of WHERE for aggregations:

> sqldf("select year, name, sum(n) as 'my_sum' from mydata group by year having my_sum > 10000 order by my_sum desc limit 100")

Counting distinct names:

> sqldf("select count(distinct name) as 'count_names' from mydata")

Creating new columns using CASE (if/else logic):

> sqldf("select year, n, case when year = '2014' then 'Young' else 'Old' end as 'young_or_old' from mydata limit 10")
> sqldf("select *, case when name != '%man%' then 'Not_a_man' when name = 'Ban%' then 'Born_with_Ban' else 'Un_Ban_Man' end as 'Name_Fun' from mydata")

Joining data sets using a key:

> crash <- read.csv.sql("crashes.csv", sql = "select * from file")
> roads <- read.csv.sql("roads.csv", sql = "select * from file")
> sqldf("select * from crash join roads on crash.Road = roads.Road")
> sqldf("select crash.Year, crash.Volume, roads.* from crash left join roads on crash.Road = roads.Road")

Joining with aggregation and multiple keys:

> sqldf("select crash.Year, crash.Volume, roads.* from crash left join roads on crash.Road = roads.Road order by 1")
> sqldf("select crash.Year, crash.Volume, roads.* from crash left join roads on crash.Road = roads.Road where roads.Road != 'US-36' order by 1")
> sqldf("select Road, avg(roads.Length) as 'Avg_Length', avg(N_Crashes) as 'Avg_Crash' from roads join crash using (Road) group by Road")
> roads$Year <- crash$Year[1:5]
> sqldf("select crash.Year, crash.Volume, roads.* from crash left join roads on crash.Road = roads.Road and crash.Year = roads.Year order by 1")

String operations in sqldf with RSQLite extension:

> library(RSQLite)
> help("initExtension")

> sqldf("select name, leftstr(name, 3) as 'First_3' from mydata order by First_3 desc limit 100")
> sqldf("select name, reverse(name) as 'Rev_Name' from mydata limit 100")
> sqldf("select name, rightstr(name, 3) as 'Back_3' from mydata order by First_3 desc limit 100")

Summary

The aim of this article was to help you get started writing queries in SQL using a blend of practical and theoretical explanations. Beyond these queries, SQL also allows you to write subqueries aka nested queries to execute multiple commands in one go. We shall learn about those in future tutorials.

As I said above, learning SQL will not only give you a fatter paycheck but also allow you to seek job profiles other than that of a data scientist. As I always say, SQL is easy to learn but difficult to master. Do practice enough.

In this article, we learned the basics of SQL. We learned about data selection, aggregation, and string manipulation commands in SQL. In addition, we also looked at the industry trend of SQL language to infer if that's the programming language you will promise to learn in your new year resolution. So, will you?

If you get stuck with any query written above, do drop in your suggestions, questions, and feedback in comments below!

Discover more articles

Gain insights to optimize your developer recruitment process.

Hackathons

Vibe Coding: Shaping the Future of Software

A New Era of CodeVibe coding is a new method of using natural language prompts and AI tools to generate code. I have seen firsthand that this change makes software more accessible to everyone. In the past, being able to produce functional code was a strong advantage for developers. Today,...

A New Era of Code

Vibe coding is a new method of using natural language prompts and AI tools to generate code. I have seen firsthand that this change makes software more accessible to everyone. In the past, being able to produce functional code was a strong advantage for developers. Today, when code is produced quickly through AI, the true value lies in designing, refining, and optimizing systems. Our role now goes beyond writing code; we must also ensure that our systems remain efficient and reliable.

From Machine Language to Natural Language

I recall the early days when every line of code was written manually. We progressed from machine language to high-level programming, and now we are beginning to interact with our tools using natural language. This development does not only increase speed but also changes how we approach problem solving. Product managers can now create working demos in hours instead of weeks, and founders have a clearer way of pitching their ideas with functional prototypes. It is important for us to rethink our role as developers and focus on architecture and system design rather than simply on typing code.

The Promise and the Pitfalls

I have experienced both sides of vibe coding. In cases where the goal was to build a quick prototype or a simple internal tool, AI-generated code provided impressive results. Teams have been able to test new ideas and validate concepts much faster. However, when it comes to more complex systems that require careful planning and attention to detail, the output from AI can be problematic. I have seen situations where AI produces large volumes of code that become difficult to manage without significant human intervention.

AI-powered coding tools like GitHub Copilot and AWS’s Q Developer have demonstrated significant productivity gains. For instance, at the National Australia Bank, it’s reported that half of the production code is generated by Q Developer, allowing developers to focus on higher-level problem-solving . Similarly, platforms like Lovable enable non-coders to build viable tech businesses using natural language prompts, contributing to a shift where AI-generated code reduces the need for large engineering teams. However, there are challenges. AI-generated code can sometimes be verbose or lack the architectural discipline required for complex systems. While AI can rapidly produce prototypes or simple utilities, building large-scale systems still necessitates experienced engineers to refine and optimize the code.

The Economic Impact

The democratization of code generation is altering the economic landscape of software development. As AI tools become more prevalent, the value of average coding skills may diminish, potentially affecting salaries for entry-level positions. Conversely, developers who excel in system design, architecture, and optimization are likely to see increased demand and compensation.
Seizing the Opportunity

Vibe coding is most beneficial in areas such as rapid prototyping and building simple applications or internal tools. It frees up valuable time that we can then invest in higher-level tasks such as system architecture, security, and user experience. When used in the right context, AI becomes a helpful partner that accelerates the development process without replacing the need for skilled engineers.

This is revolutionizing our craft, much like the shift from machine language to assembly to high-level languages did in the past. AI can churn out code at lightning speed, but remember, “Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” Use AI for rapid prototyping, but it’s your expertise that transforms raw output into robust, scalable software. By honing our skills in design and architecture, we ensure our work remains impactful and enduring. Let’s continue to learn, adapt, and build software that stands the test of time.

Ready to streamline your recruitment process? Get a free demo to explore cutting-edge solutions and resources for your hiring needs.

Tech Assessment

Guide to Conducting Successful System Design Interviews in 2025

What is Systems Design?Systems Design is an all encompassing term which encapsulates both frontend and backend components harmonized to define the overall architecture of a product.Designing robust and scalable systems requires a deep understanding of application, architecture and their underlying components like networks, data, interfaces and modules.Systems Design, in its...

What is Systems Design?

Systems Design is an all encompassing term which encapsulates both frontend and backend components harmonized to define the overall architecture of a product.

Designing robust and scalable systems requires a deep understanding of application, architecture and their underlying components like networks, data, interfaces and modules.

Systems Design, in its essence, is a blueprint of how software and applications should work to meet specific goals. The multi-dimensional nature of this discipline makes it open-ended – as there is no single one-size-fits-all solution to a system design problem.

What is a System Design Interview?

Conducting a System Design interview requires recruiters to take an unconventional approach and look beyond right or wrong answers. Recruiters should aim for evaluating a candidate’s ‘systemic thinking’ skills across three key aspects:

How they navigate technical complexity and navigate uncertainty
How they meet expectations of scale, security and speed
How they focus on the bigger picture without losing sight of details

This assessment of the end-to-end thought process and a holistic approach to problem-solving is what the interview should focus on.

What are some common topics for a System Design Interview

System design interview questions are free-form and exploratory in nature where there is no right or best answer to a specific problem statement. Here are some common questions:

How would you approach the design of a social media app or video app?

What are some ways to design a search engine or a ticketing system?

How would you design an API for a payment gateway?

What are some trade-offs and constraints you will consider while designing systems?

What is your rationale for taking a particular approach to problem solving?

Usually, interviewers base the questions depending on the organization, its goals, key competitors and a candidate’s experience level.

For senior roles, the questions tend to focus on assessing the computational thinking, decision making and reasoning ability of a candidate. For entry level job interviews, the questions are designed to test the hard skills required for building a system architecture.

The Difference between a System Design Interview and a Coding Interview

If a coding interview is like a map that takes you from point A to Z – a systems design interview is like a compass which gives you a sense of the right direction.

Here are three key difference between the two:

Coding challenges follow a linear interviewing experience i.e. candidates are given a problem and interaction with recruiters is limited. System design interviews are more lateral and conversational, requiring active participation from interviewers.

Coding interviews or challenges focus on evaluating the technical acumen of a candidate whereas systems design interviews are oriented to assess problem solving and interpersonal skills.

Coding interviews are based on a right/wrong approach with ideal answers to problem statements while a systems design interview focuses on assessing the thought process and the ability to reason from first principles.

How to Conduct an Effective System Design Interview

One common mistake recruiters make is that they approach a system design interview with the expectations and preparation of a typical coding interview.
Here is a four step framework technical recruiters can follow to ensure a seamless and productive interview experience:

Step 1: Understand the subject at hand

Develop an understanding of basics of system design and architecture
Familiarize yourself with commonly asked systems design interview questions
Read about system design case studies for popular applications
Structure the questions and problems by increasing magnitude of difficulty

Step 2: Prepare for the interview

Plan the extent of the topics and scope of discussion in advance
Clearly define the evaluation criteria and communicate expectations
Quantify constraints, inputs, boundaries and assumptions
Establish the broader context and a detailed scope of the exercise

Step 3: Stay actively involved

Ask follow-up questions to challenge a solution
Probe candidates to gauge real-time logical reasoning skills
Make it a conversation and take notes of important pointers and outcomes
Guide candidates with hints and suggestions to steer them in the right direction

Step 4: Be a collaborator

Encourage candidates to explore and consider alternative solutions
Work with the candidate to drill the problem into smaller tasks
Provide context and supporting details to help candidates stay on track
Ask follow-up questions to learn about the candidate’s experience

Technical recruiters and hiring managers should aim for providing an environment of positive reinforcement, actionable feedback and encouragement to candidates.

Evaluation Rubric for Candidates

Facilitate Successful System Design Interview Experiences with FaceCode

FaceCode, HackerEarth’s intuitive and secure platform, empowers recruiters to conduct system design interviews in a live coding environment with HD video chat.

FaceCode comes with an interactive diagram board which makes it easier for interviewers to assess the design thinking skills and conduct communication assessments using a built-in library of diagram based questions.

With FaceCode, you can combine your feedback points with AI-powered insights to generate accurate, data-driven assessment reports in a breeze. Plus, you can access interview recordings and transcripts anytime to recall and trace back the interview experience.

Learn how FaceCode can help you conduct system design interviews and boost your hiring efficiency.

AI Recruiting

How Candidates Use Technology to Cheat in Online Technical Assessments

Impact of Online Assessments in Technical Hiring In a digitally-native hiring landscape, online assessments have proven to be both a boon and a bane for recruiters and employers. The ease and...

Impact of Online Assessments in Technical Hiring

In a digitally-native hiring landscape, online assessments have proven to be both a boon and a bane for recruiters and employers.

The ease and efficiency of virtual interviews, take home programming tests and remote coding challenges is transformative. Around 82% of companies use pre-employment assessments as reliable indicators of a candidate's skills and potential.

Online skill assessment tests have been proven to streamline technical hiring and enable recruiters to significantly reduce the time and cost to identify and hire top talent.

In the realm of online assessments, remote assessments have transformed the hiring landscape, boosting the speed and efficiency of screening and evaluating talent. On the flip side, candidates have learned how to use creative methods and AI tools to cheat in tests.

As it turns out, technology that makes hiring easier for recruiters and managers - is also their Achilles' heel.

Cheating in Online Assessments is a High Stakes Problem

With the proliferation of AI in recruitment, the conversation around cheating has come to the forefront, putting recruiters and hiring managers in a bit of a flux.

According to research, nearly 30 to 50 percent of candidates cheat in online assessments for entry level jobs. Even 10% of senior candidates have been reportedly caught cheating.

The problem becomes twofold - if finding the right talent can be a competitive advantage, the consequences of hiring the wrong one can be equally damaging and counter-productive.

As per Forbes, a wrong hire can cost a company around 30% of an employee's salary - not to mention, loss of precious productive hours and morale disruption.

The question that arises is - "Can organizations continue to leverage AI-driven tools for online assessments without compromising on the integrity of their hiring process? "

This article will discuss the common methods candidates use to outsmart online assessments. We will also dive deep into actionable steps that you can take to prevent cheating while delivering a positive candidate experience.

Common Cheating Tactics and How You Can Combat Them

Using ChatGPT and other AI tools to write code
Copy-pasting code using AI-based platforms and online code generators is one of common cheat codes in candidates' books. For tackling technical assessments, candidates conveniently use readily available tools like ChatGPT and GitHub. Using these tools, candidates can easily generate solutions to solve common programming challenges such as:
- Debugging code
- Optimizing existing code
- Writing problem-specific code from scratch
Ways to prevent it
- Enable full-screen mode
- Disable copy-and-paste functionality
- Restrict tab switching outside of code editors
- Use AI to detect code that has been copied and pasted
Enlist external help to complete the assessment

Candidates often seek out someone else to take the assessment on their behalf. In many cases, they also use screen sharing and remote collaboration tools for real-time assistance.

In extreme cases, some candidates might have an off-camera individual present in the same environment for help.

Ways to prevent it
- Verify a candidate using video authentication
- Restrict test access from specific IP addresses
- Use online proctoring by taking snapshots of the candidate periodically
- Use a 360 degree environment scan to ensure no unauthorized individual is present
Using multiple devices at the same time

Candidates attempting to cheat often rely on secondary devices such as a computer, tablet, notebook or a mobile phone hidden from the line of sight of their webcam.

By using multiple devices, candidates can look up information, search for solutions or simply augment their answers.

Ways to prevent it
- Track mouse exit count to detect irregularities
- Detect when a new device or peripheral is connected
- Use network monitoring and scanning to detect any smart devices in proximity
- Conduct a virtual whiteboard interview to monitor movements and gestures
Using remote desktop software and virtual machines

Tech-savvy candidates go to great lengths to cheat. Using virtual machines, candidates can search for answers using a secondary OS while their primary OS is being monitored.

Remote desktop software is another cheating technique which lets candidates give access to a third-person, allowing them to control their device.

With remote desktops, candidates can screen share the test window and use external help.

Ways to prevent it
- Restrict access to virtual machines
- AI-based proctoring for identifying malicious keystrokes
- Use smart browsers to block candidates from using VMs

Future-proof Your Online Assessments With HackerEarth

HackerEarth's AI-powered online proctoring solution is a tested and proven way to outsmart cheating and take preventive measures at the right stage. With HackerEarth's Smart Browser, recruiters can mitigate the threat of cheating and ensure their online assessments are accurate and trustworthy.

Secure, sealed-off testing environment
AI-enabled live test monitoring
Enterprise-grade, industry leading compliance
Built-in features to track, detect and flag cheating attempts

Boost your hiring efficiency and conduct reliable online assessments confidently with HackerEarth's revolutionary Smart Browser.

Exclusive SQL Tutorial on Data Analysis in R