Group by is an unavoidable clause for any data analyst who start analysis in SQL.
Let me explain the concept with an example
We have a cupcake table that contains some records repeating like Vanilla, Chocolate, and Strawberry
What grouping means here is arranging the cupcakes into identical
groups.
All the chocolate together, all the strawberry together
similar the rest of the Flavors.
In short SQL GROUP BY statement is used to arrange identical
data into groups.
The GROUP BY clause combines all those records(row) that
have identical values in a particular column and GROUPING can be done by column
names.
Let's see how the group by works on the cupcake table
Query
select flavor from cupcakes group by flavor;
Result
Here what happened is the group by collapsed identical
records into a single record
Now what are the situations where we can use group by sql
clause?
The first one is, if you examine this sql query we ran earlier
, with a simple group by, we saw that all
the duplicated data is also removed from this given table.Hence group by can also be used to remove
duplicates from a table. This is a common interview question as well. If the
interviewer gives a simple column and asks a question to remove the duplicates
you may use the group by clause to get the result.
The next important function of group by is , it is used with
aggregate functions like sum, count, min, max etc in which case the aggregate
produces a value for each group.
Let's see a query where group by can be used with aggregate
function count.
Query
select flavor, count(flavor)from cupcakes
group by flavor;
Result
In the result, we get the flavor and the aggregated
count.
Now we are going to discuss placing conditions on groups,
here we use group by along with having clause. Lets say in our cupcake table we
want to select only the records whose count is > 1.
Query for Group By with Having
select flavor, count(flavor) from
cupcakes group by flavor having count(flavor)>1;
The query will select flavor and count of flavor according
to groups, and then it checks for records whose count is >1. Remember Always
group by is used before placing the condition with having clause.
Result
One thing that we should keep in mind is that while
grouping, we should include only those values in the SELECT list.
If we check the above example, flavor comes with the select
statement and we group according to flavor.
I hope you understand how group by clause is used in SQL
queries.
We have a cupcake data table with two columns product_id and
flavor. Here product_id is unique, which means it will not duplicate. Some of
the flavors are duplicating. Our objective is to delete duplicated Flavors.
First what we are going to do is give row numbers to each
flavor in a sequential order. Wherever vanilla comes it gives number row number
1 for the first occurrence and for the
second occurrence row number 2 . This is what is meant by sequential order.
We are using a
function called window function -row number () , to give a unique integer value
in a sequential order.
The syntax of row_number is
ROW_NUMBER() OVER(
[PARTITION BY
column_1, column_2,…]
[ORDER BY
column_3,column_4,…]
)
The set of rows on which the ROW_NUMBER() function operates
is called a window. Here the window is the flavor on which the row number
function operates.
The PARTITION BY clause divides the window into smaller sets
or partitions. If we specify the PARTITION BY clause, the row number for each
partition starts with one and increments by one.
Then we can sort the result if required by order by clause.
The query for assigning row number will be like….
select product_id,flavor, row_number() over (partition by
flavor) as rn from cupcakes
order by flavor;
Result
Wherever Chocolate comes it gives number row number 1 for
the first record, row number 2 for the second record and row number 3 for 3rd
occurrence of the record, and wherever vanilla comes it gives number row number
1 for the first record and row number 2 for the second record. We have noticed the
row number has been given sequentially for duplicated records. Other flavors
lemon and strawberry have only one record, so the row number function gives row
number 1 for each of them.
This was our original table. Then we assigned the row number to the table.
To delete only duplicate records we have to select only the
duplicated rows,all we have to do is selectthe row number column where row number >1
We use a subquery to achieve this. A sub-query is a query
inside a query. Sub-query is also called an inner query. The inner query is
kept inside the main query.
Subquery
select product_id,flavor,rn
from (select product_id,flavor,
row_number() over (partition by flavor) as rn from cupcakes
group by product_id,flavor) as x where x.rn>1;
Result
We can see that
the result has only rows with rn>1 which are the duplicates.
We have identified
the duplicates, and now all we do is delete the duplicated columns with by a
delete statement in the query.
Query to delete duplicates
Select * from cupcakes; run the query in the result we saw
all the duplicates are removed.
We will see the use of case statement in sql with the help of an example
If we have cupcake sales data, and if the sales amount > 10000, we want to
set “Big Order” and if the sales order <10000, I want to set “Small Order”.
Example
So the logic will be if sales order >10000, then it will
give “ Big Order” else “Small Order”
Basically, a case statement in SQL allows us to create
conditional expressions similar to an IF-THEN-ELSE statement.
Another feature of the SQL case statement is it adds a
column with values based on the condition.
The syntax for case statement is
SELECT column1, column2,... ,
CASE
WHEN condition THEN
result
END AS alias_name
FROM table;
The syntax of CASE always starts with the CASE keyword and
ends with the END keyword followed by a column name alias.
Let's see this with an example
Query for Table cupcake
CREATE TABLE cupcake (
product_id INT
PRIMARY KEY,
flavor
VARCHAR(50),
sale_amount INT
);
INSERT INTO cupcake (product_id, flavor, sale_amount)
VALUES
(1, 'Vanilla',
10000),
(2, 'Chocolate',
20000),
(3, 'Strawberry',
2750),
(4, 'Red Velvet',
3500),
(5, 'Lemon',
2250),
(6, 'Blueberry',
3250),
(7, 'Caramel',
4000),
(8, 'Pumpkin
Spice', 3750),
(9, 'Mocha',
4250),
(10, 'Raspberry',
2900);
select * from cupcake
CUPCAKE SALES DATA
Query with case statement
select *,
case
when sale_amount >=10000 then 'Big Order'
when sale_amount <10000 then 'Small Order'
end as order_volume
from cupcake;
Result
Here is the result. We can see as per our condition wherever
the sales amount > 10000, we are getting “Big Order” and if the sales order <10000,
we are getting “small order”., in a new column.
Important points to take care of while
using a case statement are
CASE statement always starts with the CASE
keyword
Ends with END keyword
The END keyword is followed by a column
name alias.
A case statement adds a
column based on the given condition
Finding duplicates in SQL involves identifying rows in a
table where certain columns have identical values. Here is a very common method
to achieve this:
Using GROUP BY and HAVING
Let us take an example.
We have cupcake data. In this table, the flavor of the
cupcake is duplicated. Because our data is small, we can say by looking at the
table the duplicate records are 2-Vanilla and 2-Chocolate.
Query for cupcakes table
CREATE TABLE cupcakes (
flavor VARCHAR(50)
);
INSERT INTO cupcakes (flavor) VALUES
('Vanilla'),
('Chocolate'),
('Strawberry'),
('Vanilla'),
('Chocolate'),
('Red Velvet');
select * from cupcakes;
Example Table
Remember we intend to identify only the duplicated flavor of
cupcakes, we are not deleting the records.
First, we need to check the count of each flavor.
Query for finding the count of each flavor
select flavor, count(flavor) from cupcakes
group by flavor;
Result
If the count is one we know the records are not duplicated,
but if the count is more than one that means it has duplicated records.
Now how do we extract only the duplicated records?
For this we give another condition in the query
with having clause, the condition is that it returns the records with count
>1. Along
with our previous query, we add this condition using the having function. When
adding having condition where count>1, it selects only those records whose
count is greater than 1.
Ever wondered why your database feels a bit messy? Meet the
troublemakers called duplicates – they slow
things down, mess up accuracy, and
hog space. In this article, I will share how you can delete duplicate
rows in
SQL.
Method 1- By using distinct function
The DISTNCT keyword removes duplicate rows from the results
of a SELECT statement
or in technical
language we can say a distinct keyword is used to remove redundant data
Query to create the example table
CREATE TABLE cupcakes (flavor VARCHAR(50));
INSERT INTO cupcakes (flavor) VALUES
('Vanilla'),
('Chocolate'),
('Strawberry'),
('Vanilla'),
('Chocolate'),
('Red Velvet');
Example Table
Query using "distinct" function
select distinct flavor from cupcakes;
Result Table
Method 2 -By using group by clause
Group by clause allows us to collapse a field into its
distinct values.
Example Table
Here the vanilla is repeated 2 times,
similarly chocolate is also repeated two times. What group by clause does is it
will collapse the two vanilla into one group and the two chocolate into one
group. Other records remains the same.