Many times, we do not want duplicate rows or values in our SQL tables. On the other hand, in some situations, it does not matter if there are duplicates present. For whatever reason, suppose duplicates have found their way into one of your tables. How can you find them quickly and easily? The ROW_NUMBER()
Window function is a fantastic tool to use. Continue reading and see example queries you can apply to your own tables and find those duplicates…
OS and DB used:
- Linux Mint 20 “Ulyana”
- MySQL 8.0.22
Self-Promotion:
If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!
Visit this fantastic YouTube video where I learned a great deal for this blog post.
For the examples in this post, we have this arbitrary table with some test data:
+----+------------+
| id | first_name |
+----+------------+
| 1 | Max |
| 2 | Mary |
| 3 | Charlie |
| 4 | Humpty |
| 5 | Roger |
| 6 | Jim |
| 7 | Jupyter |
| 8 | Max |
| 9 | Mary |
| 10 | Charlie |
| 11 | Humpty |
| 12 | Roger |
| 13 | Jim |
| 14 | Jupyter |
| 15 | Jim |
| 16 | Jim |
| 17 | Max |
| 18 | Max |
| 19 | Mary |
+----+------------+
19 rows in set (0.0011 sec)
You can likely pick out the few duplicates that are present since this is a rather small result set. However, with a larger result set, not so easy. Let’s put ROW_NUMBER()
to work in finding the duplicates.
But first, let’s visit the online window functions documentation on ROW_NUMBER()
and see the syntax and description:
“Returns the number of the current row within its partition. Rows numbers range from 1 to the number of partition rows. ORDER BY affects the order in which rows are numbered. Without ORDER BY, row numbering is nondeterministic.”
ROW_NUMBER() Window Function with empty OVER() clause in MySQL
To start, let’s query the data with ROW_NUMBER()
and no arguments in the OVER()
clause:
-> FROM some_names;
+----+------------+---------------------+
| id | first_name | ROW_NUMBER() OVER() |
+----+------------+---------------------+
| 1 | Max | 1 |
| 2 | Mary | 2 |
| 3 | Charlie | 3 |
| 4 | Humpty | 4 |
| 5 | Roger | 5 |
| 6 | Jim | 6 |
| 7 | Jupyter | 7 |
| 8 | Max | 8 |
| 9 | Mary | 9 |
| 10 | Charlie | 10 |
| 11 | Humpty | 11 |
| 12 | Roger | 12 |
| 13 | Jim | 13 |
| 14 | Jupyter | 14 |
| 15 | Jim | 15 |
| 16 | Jim | 16 |
| 17 | Max | 17 |
| 18 | Max | 18 |
| 19 | Mary | 19 |
+----+------------+---------------------+
19 rows in set (0.0013 sec)
Based on the query results above, we can see that ROW_NUMBER()
provides an incrementing set of numbers in order, starting at 1 and ending with a number equal to that of the total number of rows.
ROW_NUMBER() Window Function with PARTITION BY clause in MySQL
How can you isolate duplicate rows using ROW_NUMBER()
?
The optional PARTITION BY
clause of the OVER
clause will create sub-groups of rows for its specified value. For the purposes of this blog post, we will assume that we want to target the ‘first_name’ column and find any duplicates for that column.
In a nutshell, PARTITION BY first_name
will subdivide the ‘first_name’ column into separate groups of identical values. An example query provides a better, visual understanding:
-> ROW_NUMBER() OVER(PARTITION BY first_name) AS r_num
-> FROM some_names;
+----+------------+-------+
| id | first_name | r_num |
+----+------------+-------+
| 3 | Charlie | 1 |
| 10 | Charlie | 2 |
| 4 | Humpty | 1 |
| 11 | Humpty | 2 |
| 6 | Jim | 1 |
| 13 | Jim | 2 |
| 15 | Jim | 3 |
| 16 | Jim | 4 |
| 7 | Jupyter | 1 |
| 14 | Jupyter | 2 |
| 2 | Mary | 1 |
| 9 | Mary | 2 |
| 19 | Mary | 3 |
| 1 | Max | 1 |
| 8 | Max | 2 |
| 17 | Max | 3 |
| 18 | Max | 4 |
| 5 | Roger | 1 |
| 12 | Roger | 2 |
+----+------------+-------+
19 rows in set (0.0012 sec)
Note that the ‘first_name’ column is now grouped by identical values. Also, each time the ‘first_name’ sub-group changes to a different value, ROW_NUMBER()
restarts counting at 1 again and continues in consecutive order until another change occurs within the ‘first_name’ sub-groups.
The OVER()
clause also accepts an optional ORDER BY
clause that will sort the rows according to the specified value. Here I use the ‘id’ column in ASC
order. The ‘id’ column is the table’s PRIMARY KEY
so this value is UNIQUE
and a solid candidate column to sort by:
-> ROW_NUMBER() OVER(PARTITION BY first_name ORDER BY id ASC) AS r_num
-> FROM some_names;
+----+------------+-------+
| id | first_name | r_num |
+----+------------+-------+
| 3 | Charlie | 1 |
| 10 | Charlie | 2 |
| 4 | Humpty | 1 |
| 11 | Humpty | 2 |
| 6 | Jim | 1 |
| 13 | Jim | 2 |
| 15 | Jim | 3 |
| 16 | Jim | 4 |
| 7 | Jupyter | 1 |
| 14 | Jupyter | 2 |
| 2 | Mary | 1 |
| 9 | Mary | 2 |
| 19 | Mary | 3 |
| 1 | Max | 1 |
| 8 | Max | 2 |
| 17 | Max | 3 |
| 18 | Max | 4 |
| 5 | Roger | 1 |
| 12 | Roger | 2 |
+----+------------+-------+
19 rows in set (0.0177 sec)
ROW_NUMBER() Window function with Derived Table in MySQL
All that remains to do is filter the query results based on the ROW_NUMBER()
value, targeting any value greater than 1 for each sub-group. Any records that match that criteria are duplicates.
Simple enough, right? Just use the WHERE
clause:
-> FROM some_names
-> WHERE ROW_NUMBER() OVER(PARTITION BY first_name ORDER BY id ASC) > 1;
ERROR: 3593: You cannot use the window function 'row_number' in this context.
Uh-oh. Turns out, you cannot use a window function like this in the WHERE
clause predicate.
Now what?
If you wrap the ROW_NUMBER()
query in a derived table, then you can perform the filtering in the WHERE
clause:
-> FROM
-> (
-> SELECT id, first_name, ROW_NUMBER() OVER(PARTITION BY first_name ORDER BY id ASC) AS r_num
-> FROM some_names
-> ) AS t
-> WHERE t.r_num > 1
-> ORDER BY t.first_name ASC;
+----+------------+-------+
| id | first_name | r_num |
+----+------------+-------+
| 10 | Charlie | 2 |
| 11 | Humpty | 2 |
| 13 | Jim | 2 |
| 15 | Jim | 3 |
| 16 | Jim | 4 |
| 14 | Jupyter | 2 |
| 9 | Mary | 2 |
| 19 | Mary | 3 |
| 8 | Max | 2 |
| 17 | Max | 3 |
| 18 | Max | 4 |
| 12 | Roger | 2 |
+----+------------+-------+
12 rows in set (0.0011 sec)
And there are the duplicates, easily found using the ROW_NUMBER()
Window Function. Try it yourself next time you need to locate any duplicates in your tables.
Like what you have read? See anything incorrect? Please comment below and thank you for reading!!!
A Call To Action!
Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.
Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.
To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)
Be sure and visit the “Best Of” page for a collection of my best blog posts.
Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.
Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, are performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.
The post ROW_NUMBER() Window Function – find duplicate values. appeared first on Digital Owl's Prose.