Hello
I was given the task to "filter away" duplicate rows in a 10 million row
table with about 30 columns :S And the table has no primary key, no
constrains, nothing, and I need to find a way to clear that mess up, in a
way... sigh
You guys must think, alright, this is easy, just group by, well well, that
would be too easy, since each row isnt really "unique", lets continue the
mess....
Lets say the table contains 5 columns, A B C D E
In the new table there shall be a CHECK(A,B,C), that combination is unique.
But in the old table there are duplicates of that, and the values of D and E
may be any value on each row. (Headache yet?) Ppl who create these kinds of
heaptables should be lined up and shot :(
Lemme post some test DLL:
CREATE TABLE #Test (
A int,
B int,
C int,
D int,
E int
)
INSERT INTO #Test(A,B,C,D,E)VALUES(1,1,1,9,8)
INSERT INTO #Test(A,B,C,D,E)VALUES(1,1,1,-1,6)
INSERT INTO #Test(A,B,C,D,E)VALUES(1,2,1,1,4)
INSERT INTO #Test(A,B,C,D,E)VALUES(3,2,1,1,4)
INSERT INTO #Test(A,B,C,D,E)VALUES(3,2,1,3,4)
INSERT INTO #Test(A,B,C,D,E)VALUES(3,2,1,3,4)
SELECT * FROM #Test
/*
DESIRED RESULT
A B C D E
1,1,1,9,8
1,2,1,1,4
3,2,1,1,4
*/
DROP TABLE #Test
And everything is so screwed up it doesnt "matter" which values D and E has
(or the other 20 columns) as long as they had the value it had before. And
all columns are like varchar(x) in the table, so keep that in mind :S> And everything is so screwed up it doesnt "matter" which values D and E
> has
> (or the other 20 columns) as long as they had the value it had before.
This doesn't make sense to me, which value did it have before?
Anyway, I'll try, can you do this:
SELECT A,B,C,MAX(D),MAX(E)
FROM table
GROUP BY A,B,C;
?
There is no ANY() function, so you either need to choose an aggregate, or
maybe you could use a correlated subquery with ORDER BY CHECKSUM(NEWID())
but I'm not clear that would work, nor without better requirements am I
inclined to try.
A|||Lasse Edsvik wrote:
> Hello
> I was given the task to "filter away" duplicate rows in a 10 million row
> table with about 30 columns :S And the table has no primary key, no
> constrains, nothing, and I need to find a way to clear that mess up, in a
> way... sigh
> You guys must think, alright, this is easy, just group by, well well, that
> would be too easy, since each row isnt really "unique", lets continue the
> mess....
> Lets say the table contains 5 columns, A B C D E
> In the new table there shall be a CHECK(A,B,C), that combination is unique
.
> But in the old table there are duplicates of that, and the values of D and
E
> may be any value on each row. (Headache yet?) Ppl who create these kinds o
f
> heaptables should be lined up and shot :(
> Lemme post some test DLL:
>
> CREATE TABLE #Test (
> A int,
> B int,
> C int,
> D int,
> E int
> )
> INSERT INTO #Test(A,B,C,D,E)VALUES(1,1,1,9,8)
> INSERT INTO #Test(A,B,C,D,E)VALUES(1,1,1,-1,6)
> INSERT INTO #Test(A,B,C,D,E)VALUES(1,2,1,1,4)
> INSERT INTO #Test(A,B,C,D,E)VALUES(3,2,1,1,4)
> INSERT INTO #Test(A,B,C,D,E)VALUES(3,2,1,3,4)
> INSERT INTO #Test(A,B,C,D,E)VALUES(3,2,1,3,4)
>
> SELECT * FROM #Test
> /*
> DESIRED RESULT
>
> A B C D E
> 1,1,1,9,8
> 1,2,1,1,4
> 3,2,1,1,4
> */
> DROP TABLE #Test
>
> And everything is so screwed up it doesnt "matter" which values D and E ha
s
> (or the other 20 columns) as long as they had the value it had before. And
> all columns are like varchar(x) in the table, so keep that in mind :S
Always tell us what version of SQL Server you are using.
In SQL Server 2005:
WITH T (row_num) AS
(SELECT ROW_NUMBER() OVER
(PARTITION BY a,b,c ORDER BY a,b,c,d,e)
FROM #Test)
DELETE FROM T
WHERE row_num > 1;
Google for "delete duplicates" and you'll find lots of other solutions
in the archives of this group.
Test it out and make sure you have a current backup first :-)
David Portas, SQL Server MVP
Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.
SQL Server Books Online:
http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
--|||David,
sry, forgot that :) I'm using SQL 2000
"David Portas" <REMOVE_BEFORE_REPLYING_dportas@.acm.org> wrote in message
news:1147360308.771189.249980@.i40g2000cwc.googlegroups.com...
> Lasse Edsvik wrote:
a
that
the
unique.
and E
of
has
And
> Always tell us what version of SQL Server you are using.
> In SQL Server 2005:
> WITH T (row_num) AS
> (SELECT ROW_NUMBER() OVER
> (PARTITION BY a,b,c ORDER BY a,b,c,d,e)
> FROM #Test)
> DELETE FROM T
> WHERE row_num > 1;
> Google for "delete duplicates" and you'll find lots of other solutions
> in the archives of this group.
> Test it out and make sure you have a current backup first :-)
> --
> David Portas, SQL Server MVP
> Whenever possible please post enough code to reproduce your problem.
> Including CREATE TABLE and INSERT statements usually helps.
> State what version of SQL Server you are using and specify the content
> of any error messages.
> SQL Server Books Online:
> http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
> --
>|||Hi Aaron,
consider raw data like this:
INSERT INTO #Test(A,B,C,D,E)VALUES(1,1,1,9,6)
INSERT INTO #Test(A,B,C,D,E)VALUES(1,1,1,-1,8)
then the query
SELECT A,B,C,MAX(D),MAX(E)
FROM table
GROUP BY A,B,C;
will produce a row
1,1,1,9,8
which is not present in the original data. Does it make sence?|||alter table #test add f timestamp
go
select * from #test
go
select A,B,C,D,E from #test where not exists(select 1 from #test t1
where t1.a=#test.a
and t1.b=#test.b
and t1.c=#test.c
and t1.f>#test.f)
A B C D E
-- -- -- -- --
1 1 1 -1 6
1 2 1 1 4
3 2 1 3 4
(3 row(s) affected)|||> which is not present in the original data. Does it make sence?
I don't know, I don't think the requirements were specific enough to make
you right or to make me wrong. I was just offering one possible solution.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment