Deduplicating CSV data — strategies and pitfalls

Duplicate rows in CSV files are one of the most common data quality problems. They show up everywhere: CRM exports with repeated contacts, transaction logs that got appended twice, survey responses submitted multiple times, or merge operations that didn't account for overlap. The fix sounds simple — just remove the duplicates — but the details matter more than you'd expect.

What counts as a duplicate?

The first question isn't "how do I remove duplicates?" It's "what does duplicate mean for this data?"

Consider a CSV of customer orders:

order_id,customer_email,product,amount,date
1001,[email protected],Widget A,29.99,2026-03-01
1002,[email protected],Widget B,49.99,2026-03-01
1001,[email protected],Widget A,29.99,2026-03-01
1003,[email protected],Widget A,29.99,2026-03-15

Row 3 is clearly a duplicate of row 1 — every field matches. But row 4 has the same customer, product, and amount with a different date and order ID. Is that a duplicate? It depends on context. Maybe Alice genuinely ordered the same product twice. Maybe someone re-imported March data.

This is why deduplication always starts with deciding which columns to compare.

Choosing comparison columns

Full-row comparison (all columns) is the safest default. Two rows must be identical in every field to be considered duplicates. This catches exact re-imports and copy-paste errors without false positives.

Subset comparison is more powerful but requires judgment. Common approaches:

Natural key columns: Compare on fields that should be unique by design — order_id, email + timestamp, sku + warehouse_id. This catches duplicates even when non-key fields differ (maybe someone edited a note or corrected a typo in the second copy).
Business logic columns: Compare on the fields that matter for your analysis. If you're counting unique customers, deduplicate on email alone. If you're counting unique transactions, use customer_id + product + date.
Excluding volatile columns: Ignore columns like updated_at, row_number, or import_batch_id that change on every export but don't represent meaningful differences.

The danger with subset comparison is false positives. If you deduplicate a customer list on last_name alone, you'll merge every Smith into one record. Always sanity-check by looking at what you're about to remove before committing to the result.

What to keep: first, last, or none

When duplicates are found, you have three options for what stays:

Keep first retains the earliest occurrence and discards the rest. This is the right choice when your data is append-only and later entries are accidental re-imports. Most spreadsheet tools and SQL DISTINCT ON work this way by default.

Keep last retains the most recent occurrence. Use this when later rows represent corrections or updates. If your CRM exports overwrite records and appends them to the same file, the last occurrence has the most current data.

Keep none removes all rows that have any duplicates — not just the extras, but the original too. This is a diagnostic mode. It answers the question "which records are problematic?" rather than "give me clean data." You'd use this to identify rows that need manual review rather than automated cleanup.

The choice matters more than people realize. I've seen pipelines that defaulted to "keep first" on a dataset where the second entry was actually a correction, silently preserving stale data for months.

Case sensitivity

Here's a gotcha that bites people regularly: [email protected] and [email protected] are different strings but (usually) the same email address.

If you're deduplicating on text fields, decide whether comparison should be case-sensitive or case-insensitive. Case sensitivity makes sense for things like product SKUs (ABC-100 vs abc-100 might be different products in some systems). Case insensitivity makes sense for emails, names, and most human-entered text.

Most deduplication tools compare case-sensitively by default. If yours does, and you need insensitive matching, you have two options: normalize the column to lowercase before deduplicating, or use a tool that supports case-insensitive comparison as a toggle. MakeMyStats's CSV Dedupe tool performs exact string comparison, so normalize the relevant columns first if case-insensitive matching matters for your use case.

Whitespace: the invisible duplicate

Whitespace issues cause more deduplication failures than any other single factor. These rows look identical when you view them in a spreadsheet, but they aren't:

"John Smith"
"John Smith "
" John Smith"
"John  Smith"

Trailing spaces, leading spaces, and double spaces within a value all make two otherwise-identical rows compare as different. CSV files exported from different tools handle whitespace differently — Excel trims, some database exports don't, and copy-pasting from web pages often adds invisible characters.

Before deduplicating, trim whitespace from all text fields and collapse multiple internal spaces to single spaces. Do this as a preprocessing step, not as part of the deduplication itself. Otherwise you'll have clean deduplication but dirty data in the output.

Watch out for non-breaking spaces (Unicode \u00A0) too. They look identical to regular spaces but aren't equal. They commonly appear in data copied from web pages or PDFs.

Numeric precision

Another subtle trap: floating-point representation. These values are mathematically equal but not string-equal:

29.99
29.990
29.9900000000001

If you're deduplicating on a column that contains decimal numbers, string comparison will miss matches where values differ only in trailing zeros or floating-point artifacts. Either round the column to a fixed number of decimal places before comparing, or use a numeric-aware comparison.

This comes up most often with financial data exported from different systems, or scientific data where instruments report different precisions.

Date and time formats

Similar to numeric precision, dates cause false negatives when formats vary:

2026-03-01
03/01/2026
March 1, 2026
2026-03-01T00:00:00Z

All represent the same date. If your CSV was assembled from multiple sources, or if a human edited some rows in a spreadsheet, you might have mixed formats in the same column. Standardize to a single format (ISO 8601 YYYY-MM-DD is ideal) before deduplicating.

The deduplication workflow

Here's the process that avoids the most mistakes:

Inspect the data first. Look at a sample of rows. Identify which columns should form the uniqueness key. Look for whitespace, case, and format inconsistencies.
Preprocess. Trim whitespace, normalize case on comparison columns if needed, standardize date and number formats. Do this in a separate step so you can verify the normalization before it feeds into deduplication.
Run deduplication on a subset. Start with a small sample. Check what gets removed. If the duplicate count seems too high or too low, revisit your column selection.
Check the discarded rows. Don't just trust the output — look at what was removed. If legitimate records disappeared, your comparison key was too broad.
Keep the original. Never overwrite your source file. Export the deduplicated result to a new file. You want the ability to go back.

Common mistakes

Deduplicating too aggressively. Comparing on too few columns or ignoring legitimate differences. If you deduplicate orders on just product + amount, you'll merge separate purchases of the same item.

Deduplicating too late. If you merge two datasets and then deduplicate, you might remove rows that were already present in both sources but represent genuinely distinct records. Deduplicate each source individually first, then merge.

Forgetting about row order. Some datasets carry meaning in row order (time series, ranked lists). Deduplication can reorder or remove rows in ways that break that ordering. If order matters, make sure your tool preserves it and that "keep first" or "keep last" respects the original sequence.

Not validating the result. After deduplication, check that the row count makes sense. If you started with 10,000 rows and ended with 50, something went wrong. A basic sanity check — expected duplicate rate vs. actual — catches configuration errors before they propagate.

Try it

MakeMyStats's CSV Dedupe tool handles the core workflow: drop a CSV, select which columns to compare, choose first/last/none keep mode, and download the clean result. It runs entirely in your browser — your data never leaves your machine.

For preprocessing steps like case normalization or whitespace trimming, the CSV Filter tool can help narrow down the rows before you deduplicate. Combining both gives you a full cleanup pipeline without uploading anything.