Pooja Chavan
1 min readOct 5, 2022

--

Data Cleaning with SQL- Identifying Duplicate Rows

Generally speaking, a table should have unique constraints to prevent duplication of values into specific column/s. However, you may find yourself working with a database where duplicate rows have been created through human error, a bug in the application, or uncleaned data from various sources.

So, what makes a row duplicate? Most of the time, it’s the same value in one/more column/s.

Following are commonly used ways to identify duplicate values:
✔ Using GROUP BY with COUNT()
◼ Widely used method
◼ Group by the column/s that may contain duplicate/s and display only those groups having count more than 1

✔ Using window function ROW_NUMBER()
◼ Simplest of all window functions
◼ ROW_NUMBER numbers all ties sequentially (for example 1, 2, 3, 4, 5).

✔ Using window function RANK()
◼ RANK provides the same numeric value for ties

✔ Using CTE in conjunction with window function ROW_NUMBER()
◼ Using a CTE is useful when we need to filter our query using WHERE clause.

--

--

Pooja Chavan

Data Cleaning/Wrangling/Visualization | Excel | SQL |Tableau