Saturday, February 24, 2024

How to delete duplicate records ?- SQL

 We have a cupcake data table with two columns product_id and flavor. Here product_id is unique, which means it will not duplicate. Some of the flavors are duplicating. Our objective is to delete duplicated Flavors.










First what we are going to do is give row numbers to each flavor in a sequential order. Wherever vanilla comes it gives number row number 1 for the first occurrence and for the second occurrence row number 2 . This is what is meant by sequential order.



 







We are using a function called window function -row number () , to give a unique integer value in a sequential order.

The syntax of row_number is

ROW_NUMBER() OVER(

    [PARTITION BY column_1, column_2,…]

    [ORDER BY column_3,column_4,…]

)

The set of rows on which the ROW_NUMBER() function operates is called a window. Here the window is the flavor on which the row number function operates.











The PARTITION BY clause divides the window into smaller sets or partitions. If we specify the PARTITION BY clause, the row number for each partition starts with one and increments by one.

Then we can sort the result if required by order by clause.

The query for assigning row number will be like….

select product_id,flavor, row_number() over (partition by flavor) as rn from cupcakes

order by flavor;

Result


Wherever Chocolate comes it gives number row number 1 for the first record, row number 2 for the second record and row number 3 for 3rd occurrence of the record, and wherever vanilla comes it gives number row number 1 for the first record and row number 2 for the second record. We have noticed the row number has been given sequentially for duplicated records. Other flavors lemon and strawberry have only one record, so the row number function gives row number 1 for each of them.

This was our original table. Then we  assigned the row number to the table.





 





To delete only duplicate records we have to select only the duplicated rows,  all we have to do is select  the row number column where row number >1

We use a subquery to achieve this. A sub-query is a query inside a query. Sub-query is also called an inner query. The inner query is kept inside the main query.

Subquery

 select product_id,flavor,rn

from (select product_id,flavor, 

  row_number() over (partition by flavor) as rn from cupcakes

group by product_id,flavor) as x where x.rn>1;


Result







We can see that the result has only rows with rn>1 which are the duplicates.

We have identified the duplicates, and now all we do is delete the duplicated columns with by a delete statement in the query.

Query to delete duplicates








Select * from cupcakes; run the query in the result we saw all the duplicates are removed.

Result













No comments:

Post a Comment