Skip to main content

Recipe

Deduplicating Records in Redshift

Remove duplicate ids from any BI Connector table using ROW_NUMBER().

intermediateVerified 2026-05-19
All recipes
intermediateVerified 2026-05-19

Tables

usercourses, quizattempts, userlearningpaths, userbundles

Tags

data-quality
redshift
window-functions

TL;DR

Redshift does not enforce primary keys. Wrap any BI Connector table in a ROW_NUMBER() OVER (PARTITION BY id ORDER BY updatedat DESC) CTE and keep row_num = 1.

Problem

High-volume tables (usercourses, quizattempts, userlearningpaths, userbundles) can contain more than one row with the same id. Aggregations like COUNT(*), SUM(...), and joins on id then silently over-count.

usercourses.iduseridstatusupdatedat
abc-123u1started2026-04-01 10:00:00
abc-123u1completed2026-04-02 12:00:00
def-456u2started2026-05-01 09:00:00

Solution

How it works

  • PARTITION BY id groups rows by their (non-unique) primary key.
  • ORDER BY updatedat DESC puts the freshest write first within each group.
  • WHERE row_num = 1 keeps only that freshest row.

Variations

  • **Different "winner":** change ORDER BY to grade DESC (best score), createdat ASC (earliest version), etc.
  • **Different uniqueness key:** if you want one row per (userid, courseid) instead of per id, change PARTITION BY id to PARTITION BY userid, courseid.

Performance

On usercourses (~112M rows) the window function scans the full table. Restrict the CTE with WHERE statusgroup = 'a' or a date filter on updatedat first when possible.

When NOT to use this

Don't apply this blindly to tables with intentionally multi-row keys (e.g. coursecustomfields has many rows per courseid on purpose — that's the EAV shape). Dedup on the table's natural business key, not always id.

Related recipes