Showing posts with label duplicated. Show all posts
Showing posts with label duplicated. Show all posts

Sunday, March 25, 2012

Deleting million rows after checking duplication

I have a table(9 fields) which contains around 10 million records. Within these 10 million records, don't know how many are duplicated rows. I wrote a cursor which checks duplication on all of the 9 fields within each row, and then returns back with the count of rows that are duplicated. I subtact 1 row and delete all the other rows. Problem is that this thing takes a lot of time to execute. At the current rate(approx. 90 records/hour) this cursor is going to take months to clean the table.

There is no PK or no indexing what so ever on the table, and I HAVE to check each and every field for duplication(all except 1 fields are nvarchar).

Please help me. I need to sort this out

What about doing a grouping on the table and pulling hte data out to another table.

Something like:

Select cola,ColB,Colc (all columns here)
INTO SomeNewTable
FROM SomeTable
Group by cola,ColB,Colc (all columns here)

HTH, Jens K. Suessmeyer.

http://www.sqlserver2005.de

Thursday, March 22, 2012

Deleting Duplicated Rows

Hi,
I have a table named "std_attn", where, by some bad coding, lots of duplicated rows have been created. And the table don't have any PK. So Now tell me the way to remove the duplicaies..............
thnxDROP DATABASE database_name

:)

just kidding

do you want to remove the duplicates, or remove all but one duplicate?

and if all but one, how do you pick which one?

is there some column that is incrementing, like an IDENTITY or a dateadded column or something? if so, you could keep the last one of each group...|||Since there is no PK, insert the duplicate records into a temp table with an identity column. You can then delete the records from the source table and insert a SELECT DISTINCT from the temp table back into your source table. After this is done, put a constraint on the table and fix the application. :)|||derrick, how does this isolate which one of the duplicates to keep?

personally, i like the approach offered in this article:
Advanced SQL Techniques - Part 1: Analyzing Duplicate Records (http://www.15seconds.com/issue/011009.htm)|||? They're duplicates. Who cares which one you keep. This gets rid of all of them, then inserts one back in.|||Hi,
No Derrick, its not that I don't want any of the duplicate rows. I need one. I will tell you, what actually i was trying..........

1. I made a temporary table which was the copy of the original one
2. Then copied all the rows from original to the temporary one.
3. Then deleted all rows from the original one
4. Defind the PK in the original table
5. Added a column in temporary table with allow null property. The data type was boolean. It was made for the reason to update the it with the value "yes" if the rows was having any duplicate row and "no" in the row is unique.
6. Then insert into the original from temporary where the column value is equal to "no"

But unfortunetly I failed at the 5th step only. I could not get the correct SQL stetement...

So it'll be a great help, if any one can get me the correct SQL statement

thnx|||did you try the method in the article i gave you the link for?|||Ok guys, this isn't that hard. The article you referred to is in reference to tables that have an identity column, which means there's something in the row that makes it intrinsically unique even though it's just an ID field. This works great, except for the fact his table does not have an identity column. This means the idea in the article does not work.

That's why he needs to do what I told him to do, which does work. There's really not a "better" way to do it if there's nothing to distinguish between the rows in the table. You were doing fine until you tried to get fancy.

1. Insert the duplicate records into a temp table with an identity column.
2. Delete the duplicate records only from the source table.
3. Insert a SELECT DISTINCT from the temp table back into your source table.
>>This part is really important and you kind of missed it last time around.
4. Put a constraint on the table and fix the application.|||? They're duplicates. Who cares which one you keep. This gets rid of all of them, then inserts one back in.

Wanna bet...

U$1000.00 That they're not......betcha something is different|||Then he needs to ask the right question. I'm answering the one he asked, so I really don't care if something is different. :) How ya' doing Brett. I just keep seeing you on these forums. lol|||But just like the day to day job you're in...is the question asked EVER what the real question is, and what the real answer is suppose to be?

Ever?

OK, sometimes...but rarley....

And yeah...an addiction is an addiction...sorry...I just can't help myself...|||If he did ask the question appropriately though, the answer is correct. If not, then the article suggested should work. We both know the real issue is the fact he got the rows in the first place. The reality is that the application needs fixed and constraints need to be created so it doesn't happen in the first place. :)|||Hi,
where, by some bad coding, lots of duplicated rows have been created. And the table don't have any PK.

Yeah Derrick...I'm not disagreeing with you ...the answers solve the the question asked...

I just don't believe the question...

Some bad coding...puuuuuuleeeeeeeeeeze...

Some bad db design...

A good db design FORCES the developers to do the right thing...

@.@.ERROR <> 0

and to be TOTALLY politically incorrect...

EDIT: ...nope...uncalled for...sorry

Damn...it's Monday again...

Sorry for the foul mood...|||You were doing fine until you tried to get fancy. me? fancy? that'll be the day

1. Insert the duplicate records into a temp table with an identity column.why does the temp table need an identity column?|||Literacy is a wonderful thing. :) You don't. I'm smoking crack and up too late. You do need the rest though.

It's just that Monday thing. (We'll all use that as an excuse today.)|||(We'll all use that as an excuse today.)

i even use it on other days as well

;)|||There is nothing to laugh brett.........

Database can be wroungly designed....................After all its human act n instinct. Else there won't b any difference between u n god. And I assume that, ur nt GOD. Just a human........................who is bound to do the mistakes................n that is wht it hapnd with th table i was talking about.
But it will more more disgusting, if that wrong design is not corrected in time. And that is why we are here to discuss. Don't make fuss of the issue. Even you may had mistakes in your early days of career. And lemme tell you one thing...... the person who has designed the databse is from genetics background. So these are the expected errors. And now I have been assigned to remove the duplicated rows, and make the table perfect, in the sense that, no more dulicated rows should be created.|||Brett ... Should we call that a second one :p

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=35538|||And now I have been assigned to remove the duplicated rows, and make the table perfect, in the sense that, no more dulicated rows should be created.

What's with getting all bent out of shape about?

My POINT...is that you do not have duplicates...Where's the table DDL?

What do you call a duplicate?

Where's the sample data?

Where's ANYTHING you can supply the good people of dbforums

And why are you so offended...can you point out what I said that caused so much angst?

So I'll know to be more careful next time?

And hey...didn't you notice my label?|||After all its human act n instinct. .

Well it was a human that started all of this business...more dba's should understand his concepts

http://en.wikipedia.org/wiki/Edgar_F._Codd

Just a human who is bound to do the mistakes

Sure enough...kinda like answering questions

the person who has designed the databse is from genetics background. So these are the expected errors.

Watch your genome...if he asks you to help in a research project...look out...

And now I have been assigned to remove the duplicated rows, and make the table perfect, in the sense that, no more dulicated rows should be created.

So now back to the real issue.

Please

Give us the DDL of the table

Like

CREATE TABLE myTable99(Col1 int, Col2 char(10), ect...

Some sample "duplicate" data

INSERT INTO myTable99(Col1, Col2, ect..
SELECT data,data,data UNION ALL
SELECT data,data,data UNION ALL
SELECT data,data,data UNION ALL
SELECT data,data,data UNION ALL
...
SELECT data,data,data

And what the expected results should be...

OK?|||this is like pushing a rope

never mind the ddl, brett

ask them merely to define what a duplicate is

remind them that autonumber/identity columns used as the pk merely masks the real issue

the real issue is, what's a duplicate?

let them do the work, not you :cool:|||OK...

Define a Duplicate

(shameful, isn't it?)

:rolleyes:|||Haven't we already defined what a duplicate is. lol I'm sooooo confused now. All these thoughts duplicating in my heads. I mean it's so confusing. All those thoughts duplicating in my headers...my heads...my my my

I'm confused.

Really.

Friday, February 17, 2012

Delete record based on existence of another record in same table?

Hi All,

I have a table in SQL Server 2000 that contains several million member
ids. Some of these member ids are duplicated in the table, and each
record is tagged with a 1 or a 2 in [recsrc] to indicate where they
came from.

I want to remove all member ids records from the table that have a
recsrc of 1 where the same member id also exists in the table with a
recsrc of 2.

So, if the member id has a recsrc of 1, and no other record exists in
the table with the same member id and a recsrc of 2, I want it left
untouched.

So, in a theortetical dataset of member id and recsrc:

0001, 1
0002, 2
0001, 2
0003, 1
0004, 2

I am looking to only delete the first record, because it has a recsrc
of 1 and there is another record in the table with the same member id
and a recsrc of 2.

I'd very much appreciate it if someone could help me achieve this!

Much warmth,

Murray"M Wells" <planetquirky@.planetthoughtful.org> wrote in message
news:io7r60hup1ajlnu3je9onmb7ki2dtkmni3@.4ax.com...
> Hi All,
> I have a table in SQL Server 2000 that contains several million member
> ids. Some of these member ids are duplicated in the table, and each
> record is tagged with a 1 or a 2 in [recsrc] to indicate where they
> came from.
> I want to remove all member ids records from the table that have a
> recsrc of 1 where the same member id also exists in the table with a
> recsrc of 2.
> So, if the member id has a recsrc of 1, and no other record exists in
> the table with the same member id and a recsrc of 2, I want it left
> untouched.
> So, in a theortetical dataset of member id and recsrc:
> 0001, 1
> 0002, 2
> 0001, 2
> 0003, 1
> 0004, 2
> I am looking to only delete the first record, because it has a recsrc
> of 1 and there is another record in the table with the same member id
> and a recsrc of 2.
> I'd very much appreciate it if someone could help me achieve this!
> Much warmth,
> Murray

I think this is what you're looking for:

delete from dbo.MyTable
where recsrc = 1
and exists (
select * from dbo.MyTable m2
where MyTable.MemberID = m2.MemberID
and m2.recsrc = 2)

Simon|||On Fri, 02 Apr 2004 17:19:29 GMT, M Wells
<planetquirky@.planetthoughtful.org> wrote:

And just to show that I am trying, I attempted:

delete from #mw_dupetest as md where recsrc = 1 and exists (select mid
from #mw_dupetest where mid = md.mid and recsrc = 2)

This is obviously wrong, since I can't seem to assign a table alias in
a delete statement and I can't think of any other way of referring to
the mid column in the exists statement.

So, I'm hoping somone can help me understand how to do this the right
way.

Much warmth,

Murray|||On Fri, 2 Apr 2004 19:28:16 +0200, "Simon Hayes" <sql@.hayes.ch> wrote:

>> Murray
>I think this is what you're looking for:
>delete from dbo.MyTable
>where recsrc = 1
>and exists (
>select * from dbo.MyTable m2
>where MyTable.MemberID = m2.MemberID
>and m2.recsrc = 2)

Hi Simon,

Thank you for this! Seems like I was somewhat on the right track, I
just fudged on attempting to alias the table in the delete statement.

Thanks again!

Much warmth,

Murray

Tuesday, February 14, 2012

Delete one of the duplicated row

I have many data in a table in which some rows are duplicated. How can I, for all duplicated rows, delete the extra rows and leave only one? You may assume checking one column is enough to tell if a row is duplicated.

Thanks

Step one: create a second table with identical structure: yourSecondTable;

Step two: CREATE UNIQUE INDEX removedups ON yourSecondTable (col1) WITH IGNORE_DUP_KEY

Step three:

INSERT yourSecondTable

SELECT * FROM yourFirstTable

--Duplicate key was ignored.

|||

DECLARE @.FirstName varchar(20),
@.LastName varchar(20),
@.Age int --declare all fields in table

DECLARE c1 CURSOR FOR

--Find Dupes
SELECT FirstName, LastName, Age
FROM MyTable
GROUP BY FirstName, LastName, Age
HAVING COUNT(FirstName) > 1

OPEN c1
FETCH NEXT FROM c1
INTO @.FirstName, @.LastName, @.Age
WHILE @.@.FETCH_STATUS = 0

BEGIN

--Delete all dupes...the cursor remembers the current record
DELETE FROM MyTable
WHERE FirstName IN (SELECT FirstName FROM MyTable GROUP BY FirstName HAVING COUNT(FirstName) > 1)

--insert the current record back into the table
INSERT INTO MyTable(FirstName, LastName, Age) VALUES(@.FirstName, @.LastName, @.Age)

FETCH NEXT FROM c1
INTO @.FirstName, @.LastName, @.Age
END

CLOSE c1
DEALLOCATE c1

Adamus

|||

Please look at my reply in the thread below for some solutions. You can do this with a single DELETE statement and batch it using SET ROWCOUNT or TOP clause if you have large number of rows to delete. And if you don't have any primary key or unique key on the table then you can use a cursor based approach.

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=666011&SiteID=1&PageID=1

|||

Umachandar Jayachandran - MS wrote:

Please look at my reply in the thread below for some solutions. You can do this with a single DELETE statement and batch it using SET ROWCOUNT or TOP clause if you have large number of rows to delete. And if you don't have any primary key or unique key on the table then you can use a cursor based approach.

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=666011&SiteID=1&PageID=1

Question. Why would the absence of a primary key be a determinant in selecting a cursor based approach?

Adamus

|||Because there is no easy way to determine which row to keep out of the duplicates which you can do easily in a single DELETE statement without writing procedural code.