Saturday, February 25, 2012

Can someone proofread my remove duplicates script?

DELETE
FROM tblContacts
WHERE tblContacts.ID IN(
SELECT F.ID
FROM tblContacts AS F
WHERE Exists (
SELECT email, Count(ID)
FROM tblContacts
WHERE tblContacts.email = F.email
GROUP BY tblContacts.email
HAVING Count(tblContacts.ID) > 1
)
)
AND tblContacts.ID NOT IN(
SELECT Min(ID)
FROM tblContacts AS F
WHERE Exists (
SELECT email, Count(ID)
FROM tblContacts
WHERE tblContacts.email = F.email
GROUP BY tblContacts.email
HAVING Count(tblContacts.ID) > 1
)
GROUP BY email
)

I readily admit that I've shamelessly copied 'n pasted this from a tutorial and then taken a stab at tweaking it for my own ends. But I really don't understand what it's doing.

Really, all I want to know is that it will remove records with duplicate email fields. But I could also do with confirming - looking at the "SELECT Min(ID)" bit - does that mean that if it finds a duplicate, it'll delete the latest-added one? And if so, that changing it to remove the earliest-added one is simply a case of changing MIN to MAX?

Thanks :)A good tip for keeping your sanity is to always run scripts like this on a testing version of your database and then confirm that it has worked before even contemplating running it on prod. As such - the below is based on my best reading of the script.

Yes it will work. Yes it will delete the most recently inserted record(s) (assuming that the ID field is a monotonically increasing value such as an identity and a higher number always indicates a more recently inserted record). And yes - you can change MIN to MAX to retain the most recently inserted record.

Have a read through the script a few times though - even if you don't consider it necesary it would be nice to know what it is doing and why.

HTH|||Yeah, a test run would be advisable. Good point. I've tried reading my way through it and I just get bogged down in "so we get one list that's... and those exist in... and that doesn't exist... and..." and the will to live rapidly leaves me. I think I get it now, though. I guess I just needed someone to tell me it did do what I thought before I tried to figure out exactly how.

Thanks for the help.|||While it lookes to me like the code you posted should work, I'd suggest a simpler approach. It is a lot easier to read (at least for me anyway), and probably easier to understand.

DELETE FROM tblContacts
WHERE ID <> (SELECT Max(z.ID)
FROM tblContacts AS z
WHERE z.email = tblContacts.email))-PatP|||Ooh, that's easier :D Nice one :beer:|||Oh yeah... One thing I ought to mention before you go trundling off, this code snippet will trip over any rows that have an ID column that is NULL. This may make you need to add an ID IS NOT NULL to the outer clause if there is any chance of encountering a legitimate NULL value in the ID column. This is unlikely, but some schemas will permit it, and the results can be catastrophic!

-PatP|||A handy trick I have used in such cases is rewrite the statement as a SELECT, rather than an update or delete. See what records you are about to modify/maim, and if you have no objections, then you can run the actual data modification.|||DELETE FROM tblContacts
WHERE ID <> (SELECT Max(z.ID)
FROM tblContacts AS z
WHERE z.email = tblContacts.email))
Actually - reading this more carefully, I'm a bit confused. It looks like it'll only do one at a time? Is that right? If so, that's not a bad thing - in fact, it'd be good to know that I've managed to guess what a statement will do before I run it :rolleyes: :D

edit:

Well... having read MCrowley's excellent advice: clearly it doesn't. It gives me a big list of duplicates. But I don't understand how? It selects MAX - which is only going to return one record, right? And then it selects (or deletes) WHERE ID <> - not "is not in [a range]", but "does not equal [a value]".

I'm confused again :(|||The key is in the corrolation ;)

DELETE FROM tblContacts
WHERE ID <> (SELECT Max(z.ID)
FROM tblContacts AS z
WHERE z.email = tblContacts.email))
If you remove the bit in bold then yes - you would get one ID returned. However the bit in bold corrolates the inner and outer query so there is one MAX(ID) returned per email address.

It can be rewritten as a select query as a join of two tables that might make it more obvious:


SELECT tblContacts.*
FROM tblContacts INNER JOIN
(SELECT Max(z.ID) AS TheMaxID,
email
FROM tblContacts
GROUP BY email) AS z ON
z.email = tblContacts.email
WHERE tblContacts.ID <> TheMaxID

HTH|||The more I think about this, the harder it gets :D I swear there's a SQL gene.

Anyway - thanks for the help and patience, everyone. I think I need to just go and play with these statements and get my head round them.|||Just practice - SQL is very easy to learn but rather tricky to master.

Run the inner query on its own:
SELECT Max(z.ID) AS TheMaxID,
email
FROM tblContacts
GROUP BY emailThat might illuminate...|||When you have a subquery (one query that is logically "nested inside" of another query), the subquery gets re-evaluated for each row returned by the outer query. If the DELETE query materializes a million rows from the tblContacts table, then the SELECT query would be evaluated a million times, once for each of the rows materialized by the DELETE query.

For each row in tblContacts, that DELETE checks to see if that row has the Max(ID) value for a given email address. If the row does not have the Max(ID) for that email, then the row is deleted.

While this will sometimes confuse people, it does not confuse SQL! ;) The only point that can confuse SQL is NULL values. A NULL email is simply ignored, considered junque and deleted (the explanation of that gets a bit tricky, just take it on faith for now). A NULL ID is also deleted, for a different (but similar) reason.

Once you understand the way this works for non-NULL values, we'll worry about the NULL values. They are almost assuredly garbage anyway, so don't burn much time on them yet.

-PatP

No comments:

Post a Comment