Monday, July 8, 2013

SQL - How to delete DUPLICATE ROWS in a table?

Answer collected from the below site.Please refer to the site for details.
blog.sqlauthority.com-sql-server-delete-duplicate-records-rows

Table in example has ID as Identity Column and Columns which have duplicate data are DuplicateColumn1, DuplicateColumn2 and DuplicateColumn3.

DELETE
FROM MyTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyTable
GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3)

------
blog.sqlauthority.com-delete-duplicate-rows-Using CTE
/* Delete Duplicate records */
WITH CTE (COl1,Col2, DuplicateCount)
AS
(
SELECT COl1,Col2,
ROW_NUMBER() OVER(PARTITION BY COl1,Col2 ORDER BY Col1) AS DuplicateCount
FROM DuplicateRcordTable
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
GO

---


...

Another possible way of doing this is

;WITH ct.
     AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 
                                       ORDER BY ( SELECT 0)) RN
         FROM   #MyTable)
DELETE FROM cte
WHERE  RN > 1

I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC


Refer Source:

SQL Query - Delete duplicates if more than 3 dups?


with cte as (
  select row_number() over (partition by dupcol1, dupcol2 order by ID) as rn
     from table)
delete from cte
   where rn > 2; -- or >3 etc
The query is manufacturing a 'row number' for each record, grouped by the (dupcol1, dupcol2) and ordered by ID. In effect this row number counts 'duplicates' that have the same dupcol1 and dupcol2 and assigns then the number 1, 2, 3.. N, order by ID. If you want to keep just 2 'duplicates', then you need to delete those that were assigned the numbers 3,4,.. N and that is the part taken care of by the DELLETE.. WHERE rn > 2;
Using this method you can change the ORDER BY to suit your preferred order (eg. ORDER BY ID DESC), so that the LATEST has rn=1, then the next to latest is rn=2 and so on. The rest stays the same, the DELETE will remove only the oldest ones as they have the highest row numbers.
Unlike this closely related question, as the condition becomes more complex, using CTEs and row_number() becomes simpler. Performance may be problematic still if no proper access index exists.

Refer Source: ROW_NUMBER (Transact-SQL)

Returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.


Refer Source:

How to use ROW_NUMBER() to enumerate and partition records in SQL Server


Here is our final SQL statement, which achieves the business logic we wanted to implement.
SELECT
       [PersonID]
     [FamilyID]
      ,[FirstName]
      ,[LastName]
      ,[DateOfBirth]
      ,ROW_NUMBER() over(PARTITION BY FamilyID,
                         CONVERT(NVARCHAR(25), DateOfBirth, 111)
                         ORDER BY DateOfBirth ASC) TwinCode

  FROM [People]
ORDER BY PersonID
IIn the ROW_NUMBER function above, I am doing several things. I’m grouping on FamilyID, and also grouping on a converted DateOfBirth. I convert the DateOfBirth to an nvarchar using the 111 conversion code, because that gets results like ‘2009/10/11′ and ‘2009/10/12′ which can easily be grouped by to achieve distinct dates.


------------------------------------------------------------------------------------------------------------
Reference:

What is CTE ( Common table expression) in SQL Server?


------------------------------------------------------------------------------------------------------------



No comments:

Post a Comment