Speaker diarization, explained: how speaker labels work
May 12, 2026
Speaker diarization is the automatic process of separating a recording into segments by speaker — answering “who spoke when?” before any human reads the transcript. It is the difference between a wall of words and a structured conversation, and it is the reason a transcript of a two-person interview comes back as a readable dialogue instead of one long paragraph.
In plain terms: you upload a recording, the diarization engine finds the points where the voice changes, groups the segments that sound like the same person, and hands back a transcript that reads like a script — one line per turn, labeled by voice. This post explains how that works, why diarization is not the same as recognizing a voice, and how to turn the generic labels into real names.
Speaker diarization, in plain English
Picture transcribing a 90-minute interview by hand. After turning the speech into text, the next thing you’d do is add the attributions: “Interviewer:”, “Subject:”, back and forth, every time the voice changes. That attribution work — deciding who is speaking at any given second — is exactly what diarization does for you.
The output is a dialogue-format transcript:
Speaker A [00:00:12]: The first thing to understand is that the data only exists for a moment.
Speaker B [00:00:27]: Right, and that's the part most people don't realize until it's too late.
Speaker A [00:00:34]: Exactly. So the real question becomes what you do in that window.
Each utterance is tied to a speaker and a timestamp. You can then rename Speaker A and Speaker B to real names; one edit per speaker, and the label updates everywhere it appears.
The alternative — a plain transcript with no diarization — is a continuous block of text where you have to replay the audio to work out who said what. For anything past a couple of minutes with two or more voices, that is real manual work, and it is the work the diarization pass removes.
What you need before you start
There are no settings to configure and nothing to install. Diarization runs on the recording itself, so the only real prerequisite is audio where the speakers are reasonably distinct:
- One recording with every voice in it. Diarization works within a single file, so all the speakers have to be present in that file. A round-table where everyone shares one room mic works; two people each recorded on their own separate device, saved as two separate files, does not — combine those into one mix first.
- Clear-enough audio. It doesn’t need to be studio quality. It needs each person to be audible without straining, with limited cross-talk and background noise. The cleaner the separation between voices, the cleaner the labels.
- A rough idea of how many people are talking. You don’t have to supply this, but knowing it helps you sanity-check the result — if you recorded three people and the transcript shows six speakers, you know where to look.
That’s it. No microphone enrollment, no voice samples, no per-speaker setup.
How automatic speaker identification works
Diarization breaks into a few distinct stages. Knowing them makes the output easier to read — and easier to fix when it gets something wrong.
Feature extraction. The audio is split into short frames, typically 10 to 40 milliseconds each, and the engine measures acoustic features in every frame. The features that matter for telling speakers apart live in the frequency domain — pitch, formants, and spectral patterns that tend to stay consistent within one person’s voice and differ between people.
Segmentation. The engine looks for speaker change points: moments where those voice characteristics shift. This is the hard part. A change can land mid-sentence when someone interrupts, and noise or overlapping speech can hide a real boundary or invent a false one.
Clustering. Segments that sound like the same voice are grouped together. The engine isn’t told how many people are in the room — it infers the speaker count from the audio. If you know the count ahead of time, some setups let you supply it; if not, the engine estimates it, which is where most “split one person into two” and “merged two people into one” errors come from.
Labeling. Each cluster gets a label — Speaker A, Speaker B, Speaker C, or Speaker 0, Speaker 1. The labels are arbitrary placeholders. There is no identity behind them, just a stable handle for one voice across the whole recording.
The whole pipeline runs as a single automated pass alongside the transcription. For a 60-minute recording it adds seconds, not minutes, to the total time.
Diarization vs voice recognition
These two get conflated constantly, but they answer different questions.
Diarization asks: who is speaking at this moment, relative to the other voices in this file? It separates voices but has no idea who they belong to. It needs no prior knowledge and keeps no record of any voice after the job is done.
Voice recognition, also called speaker identification, asks: is this the same person as in a known reference recording? It needs an enrolled voice sample to compare against, and it is the technology behind “verify your identity by voice” systems.
Hushscript uses diarization, not voice recognition. There is no database of enrolled voices, and your recording is never compared against anyone else’s. The speaker labels are derived purely from the voice characteristics inside that one file.
This distinction is the reason diarization can’t name people for you. It can confidently say “the same person spoke these 200 turns across the hour,” but it cannot say who that person is — that step is yours, and it takes one click per speaker. If you want a deeper look at how Hushscript handles multi-speaker recordings end to end, the speaker identification page covers it.
A worked example: a three-person meeting
Say you record a 40-minute project meeting with three people in one room, sharing a single laptop mic, saved as a single .m4a file. You drop it in, and after transcription the output looks like this:
Speaker A [00:00:04]: Okay, let's start with the launch date. Where are we?
Speaker B [00:00:09]: Design is done. We handed the final screens over on Monday.
Speaker C [00:00:14]: Engineering needs about two more weeks for the payment flow.
Speaker A [00:00:21]: Two weeks puts us past the date we promised marketing.
Speaker B [00:00:27]: Can we ship without the new payment flow and follow up?
Speaker C [00:00:33]: Not cleanly. The old flow breaks on the new pricing.
Three voices, separated and labeled, with timestamps on every turn. Now you relabel: Speaker A becomes “Priya” (she’s running the meeting), Speaker B becomes “Tom”, Speaker C becomes “Dana”. After three clicks, every line carries a real name, and the transcript is ready to paste into the meeting notes with correct attribution throughout.
The same shape holds for a two-person interview, a four-person podcast, or a two-sided phone call — the engine separates what it hears, and you put the names on.
How to relabel speakers in Hushscript
The labels arrive generic on purpose; naming them is a quick editor step. The flow before you get there is: drop your file and a 30-second speaker-labeled preview renders with no account needed, sign up to transcribe the rest, then open the finished transcript. From there:
- Click any speaker label in the transcript editor — for example,
Speaker B. - Type the real name, like “Tom”.
- Every instance of that speaker updates at once, from the first line to the last.
The rename is global, so you never edit the same label twice. For a 90-minute interview with two speakers, a full relabeling pass takes under a minute, and you end up with a quote-ready transcript that attributes every line correctly. You can then export it to TXT, SRT, DOCX, or JSON with the names baked in.
Troubleshooting common diarization issues
Diarization is reliable on clean recordings and gets harder as the audio gets messier. The common problems, and what to do about them:
Overlapping speech. When two people talk at once, the engine has to assign overlapping audio to a single speaker, and it can mislabel the crossover or drop a few words. There’s no fix in software — the audio genuinely contains two voices in the same moment. Prevention is the lever: a recording where people take turns diarizes far more cleanly than one full of interruptions. Where overlaps do happen, a quick read against the audio catches the handful of affected lines.
Similar-sounding voices. Two speakers with close pitch and accent — say, two men of similar age, or siblings — are harder to separate than a contrasting pair, because their voice characteristics genuinely overlap. The engine may occasionally swap a turn between them. Scan the transcript for lines that read oddly for the assigned speaker; those are the ones to reassign by hand.
One person split into two labels. If someone’s audio changes mid-recording — they move closer to the mic, switch from a headset to speakerphone, or the connection degrades — the engine can read the second half as a new voice and create an extra label. Merge the two by relabeling both to the same name; the duplicates collapse into one speaker.
Two people merged into one label. The reverse happens when two quiet or distant voices sound too alike for the engine to split. This is hardest to repair after the fact, so it’s worth preventing: get the mic closer to the speakers, and avoid recording from across a large room.
Far-field or noisy audio. A laptop mic at the end of a conference table, a phone on the table during a group chat, or heavy room echo all blur the boundaries between voices. Closer mic placement helps every speaker, and reducing background noise — fans, traffic, music — sharpens the separation. Diarization quality tracks audio quality closely.
Phone calls recorded as two separate files. If your recorder saved each side of a call as its own mono file, diarization on either file alone sees only one voice. Combine the two into a single mixed recording first, so both voices are present in one file, then transcribe that.
Accuracy tips that actually move the needle
A few habits make diarization noticeably better, and most of them are about the recording, not the software:
- Let people take turns. Cross-talk is the single biggest source of diarization errors. A moderated discussion beats a free-for-all.
- Get the mic closer. Distance and room echo blur voices together. A close mic per person is ideal; a shared mic in the middle of the table beats one in the corner.
- Cut obvious background noise. Music, a running fan, or a busy café all compete with the voices and muddy the boundaries.
- Keep everyone in one recording. Diarization works within a single file, so a recording that already contains every voice will always beat stitched-together separate tracks.
None of this needs to be perfect. Clear turns and a reasonable mic position get most recordings to clean labels, and the editor handles the rest.
Why it matters for interviews and meetings
Without diarization, a transcript of a multi-person conversation is close to unusable for its most common jobs:
- Direct quotes. You can’t quote someone if you don’t know which lines are theirs.
- Summary attribution. A meeting summary that doesn’t say who said what is barely more useful than no summary.
- Searching by speaker. If you’re hunting for something one person said, you can’t filter a single undifferentiated block of text.
With diarization, each of these is straightforward, and the transcript becomes a document rather than raw text. For anything that will be published or cited — journalism, research, podcast show notes — accurate attribution is not optional, and getting it from the transcript instead of reconstructing it from the audio is where the time is saved. For internal work like meeting notes, interview research, or HR records, the labeled structure is just as valuable: scanning a dialogue is far faster than reading a wall of text.
For a full walkthrough in context, see how to transcribe an interview, which covers recording, transcription, and relabeling for journalism and research, and how to transcribe a podcast for labeling hosts and guests across an episode.
Speaker labels are free on every Hushscript transcript
Speaker diarization is included on every Hushscript transcript at no extra cost. There is no diarization add-on, no separate plan tier for it, and no cap on how many speakers a recording can have. Hushscript itself is pay-as-you-go with no subscription — there are 30 free minutes to try, and you pay only for the minutes you use after that; see the pricing page for the details.
Speaker separation comes free because a transcript of a conversation without it is only half a transcript — readable text with no idea of who said it. Drop in a recording, get the labels automatically, rename them in a click, and export the result. More on how it all fits together is on the audio to text page.