Recipe

Deduplicating Records in Redshift

Remove duplicate ids from any BI Connector table using ROW_NUMBER().

intermediateVerified 2026-05-19

Tables

usercourses, quizattempts, userlearningpaths, userbundles

Problem

High-volume tables (usercourses, quizattempts, userlearningpaths, userbundles) can contain more than one row with the same id. Aggregations like COUNT(*), SUM(...), and joins on id then silently over-count.

usercourses.id	userid	status	updatedat
abc-123	u1	started	2026-04-01 10:00:00
abc-123	u1	completed	2026-04-02 12:00:00
def-456	u2	started	2026-05-01 09:00:00

Solution

How it works

PARTITION BY id groups rows by their (non-unique) primary key.
ORDER BY updatedat DESC puts the freshest write first within each group.
WHERE row_num = 1 keeps only that freshest row.

Variations

**Different "winner":** change ORDER BY to grade DESC (best score), createdat ASC (earliest version), etc.
**Different uniqueness key:** if you want one row per (userid, courseid) instead of per id, change PARTITION BY id to PARTITION BY userid, courseid.

Performance

On usercourses (~112M rows) the window function scans the full table. Restrict the CTE with WHERE statusgroup = 'a' or a date filter on updatedat first when possible.

When NOT to use this

Don't apply this blindly to tables with intentionally multi-row keys (e.g. coursecustomfields has many rows per courseid on purpose — that's the EAV shape). Dedup on the table's natural business key, not always id.

Related recipes

Filtering to Active Enrollments Only

Exclude archived, inactive, and deleted enrollments from usercourses using statusgroup.

Handling Clamped 9999 Timestamps

Normalize the 9999-12-01 sentinel that the BI Connector uses for "no expiry" values.

Latest (or Highest-Scoring) Quiz Attempt per Learner

Reduce the quizattempts table to one canonical row per learner per quiz — either the most recent or the highest scoring.