Skip to main content

Speaker diarization, explained: how speaker labels work

May 12, 2026

Speaker diarization is the automatic process of separating a recording into segments by speaker — answering “who spoke when?” before any human reads the transcript. It is the difference between a wall of words and a structured conversation, and it is the reason a transcript of a two-person interview comes back as a readable dialogue instead of one long paragraph.

In plain terms: you upload a recording, the diarization engine finds the points where the voice changes, groups the segments that sound like the same person, and hands back a transcript that reads like a script — one line per turn, labeled by voice. This post explains how that works, why diarization is not the same as recognizing a voice, and how to turn the generic labels into real names.

Speaker diarization, in plain English

Picture transcribing a 90-minute interview by hand. After turning the speech into text, the next thing you’d do is add the attributions: “Interviewer:”, “Subject:”, back and forth, every time the voice changes. That attribution work — deciding who is speaking at any given second — is exactly what diarization does for you.

The output is a dialogue-format transcript:

Speaker A [00:00:12]: The first thing to understand is that the data only exists for a moment.
Speaker B [00:00:27]: Right, and that's the part most people don't realize until it's too late.
Speaker A [00:00:34]: Exactly. So the real question becomes what you do in that window.

Each utterance is tied to a speaker and a timestamp. You can then rename Speaker A and Speaker B to real names; one edit per speaker, and the label updates everywhere it appears.

The alternative — a plain transcript with no diarization — is a continuous block of text where you have to replay the audio to work out who said what. For anything past a couple of minutes with two or more voices, that is real manual work, and it is the work the diarization pass removes.

What you need before you start

There are no settings to configure and nothing to install. Diarization runs on the recording itself, so the only real prerequisite is audio where the speakers are reasonably distinct:

That’s it. No microphone enrollment, no voice samples, no per-speaker setup.

How automatic speaker identification works

Diarization breaks into a few distinct stages. Knowing them makes the output easier to read — and easier to fix when it gets something wrong.

Feature extraction. The audio is split into short frames, typically 10 to 40 milliseconds each, and the engine measures acoustic features in every frame. The features that matter for telling speakers apart live in the frequency domain — pitch, formants, and spectral patterns that tend to stay consistent within one person’s voice and differ between people.

Segmentation. The engine looks for speaker change points: moments where those voice characteristics shift. This is the hard part. A change can land mid-sentence when someone interrupts, and noise or overlapping speech can hide a real boundary or invent a false one.

Clustering. Segments that sound like the same voice are grouped together. The engine isn’t told how many people are in the room — it infers the speaker count from the audio. If you know the count ahead of time, some setups let you supply it; if not, the engine estimates it, which is where most “split one person into two” and “merged two people into one” errors come from.

Labeling. Each cluster gets a label — Speaker A, Speaker B, Speaker C, or Speaker 0, Speaker 1. The labels are arbitrary placeholders. There is no identity behind them, just a stable handle for one voice across the whole recording.

The whole pipeline runs as a single automated pass alongside the transcription. For a 60-minute recording it adds seconds, not minutes, to the total time.

Diarization vs voice recognition

These two get conflated constantly, but they answer different questions.

Diarization asks: who is speaking at this moment, relative to the other voices in this file? It separates voices but has no idea who they belong to. It needs no prior knowledge and keeps no record of any voice after the job is done.

Voice recognition, also called speaker identification, asks: is this the same person as in a known reference recording? It needs an enrolled voice sample to compare against, and it is the technology behind “verify your identity by voice” systems.

Hushscript uses diarization, not voice recognition. There is no database of enrolled voices, and your recording is never compared against anyone else’s. The speaker labels are derived purely from the voice characteristics inside that one file.

This distinction is the reason diarization can’t name people for you. It can confidently say “the same person spoke these 200 turns across the hour,” but it cannot say who that person is — that step is yours, and it takes one click per speaker. If you want a deeper look at how Hushscript handles multi-speaker recordings end to end, the speaker identification page covers it.

A worked example: a three-person meeting

Say you record a 40-minute project meeting with three people in one room, sharing a single laptop mic, saved as a single .m4a file. You drop it in, and after transcription the output looks like this:

Speaker A [00:00:04]: Okay, let's start with the launch date. Where are we?
Speaker B [00:00:09]: Design is done. We handed the final screens over on Monday.
Speaker C [00:00:14]: Engineering needs about two more weeks for the payment flow.
Speaker A [00:00:21]: Two weeks puts us past the date we promised marketing.
Speaker B [00:00:27]: Can we ship without the new payment flow and follow up?
Speaker C [00:00:33]: Not cleanly. The old flow breaks on the new pricing.

Three voices, separated and labeled, with timestamps on every turn. Now you relabel: Speaker A becomes “Priya” (she’s running the meeting), Speaker B becomes “Tom”, Speaker C becomes “Dana”. After three clicks, every line carries a real name, and the transcript is ready to paste into the meeting notes with correct attribution throughout.

The same shape holds for a two-person interview, a four-person podcast, or a two-sided phone call — the engine separates what it hears, and you put the names on.

How to relabel speakers in Hushscript

The labels arrive generic on purpose; naming them is a quick editor step. The flow before you get there is: drop your file and a 30-second speaker-labeled preview renders with no account needed, sign up to transcribe the rest, then open the finished transcript. From there:

  1. Click any speaker label in the transcript editor — for example, Speaker B.
  2. Type the real name, like “Tom”.
  3. Every instance of that speaker updates at once, from the first line to the last.

The rename is global, so you never edit the same label twice. For a 90-minute interview with two speakers, a full relabeling pass takes under a minute, and you end up with a quote-ready transcript that attributes every line correctly. You can then export it to TXT, SRT, DOCX, or JSON with the names baked in.

Troubleshooting common diarization issues

Diarization is reliable on clean recordings and gets harder as the audio gets messier. The common problems, and what to do about them:

Overlapping speech. When two people talk at once, the engine has to assign overlapping audio to a single speaker, and it can mislabel the crossover or drop a few words. There’s no fix in software — the audio genuinely contains two voices in the same moment. Prevention is the lever: a recording where people take turns diarizes far more cleanly than one full of interruptions. Where overlaps do happen, a quick read against the audio catches the handful of affected lines.

Similar-sounding voices. Two speakers with close pitch and accent — say, two men of similar age, or siblings — are harder to separate than a contrasting pair, because their voice characteristics genuinely overlap. The engine may occasionally swap a turn between them. Scan the transcript for lines that read oddly for the assigned speaker; those are the ones to reassign by hand.

One person split into two labels. If someone’s audio changes mid-recording — they move closer to the mic, switch from a headset to speakerphone, or the connection degrades — the engine can read the second half as a new voice and create an extra label. Merge the two by relabeling both to the same name; the duplicates collapse into one speaker.

Two people merged into one label. The reverse happens when two quiet or distant voices sound too alike for the engine to split. This is hardest to repair after the fact, so it’s worth preventing: get the mic closer to the speakers, and avoid recording from across a large room.

Far-field or noisy audio. A laptop mic at the end of a conference table, a phone on the table during a group chat, or heavy room echo all blur the boundaries between voices. Closer mic placement helps every speaker, and reducing background noise — fans, traffic, music — sharpens the separation. Diarization quality tracks audio quality closely.

Phone calls recorded as two separate files. If your recorder saved each side of a call as its own mono file, diarization on either file alone sees only one voice. Combine the two into a single mixed recording first, so both voices are present in one file, then transcribe that.

Accuracy tips that actually move the needle

A few habits make diarization noticeably better, and most of them are about the recording, not the software:

None of this needs to be perfect. Clear turns and a reasonable mic position get most recordings to clean labels, and the editor handles the rest.

Why it matters for interviews and meetings

Without diarization, a transcript of a multi-person conversation is close to unusable for its most common jobs:

With diarization, each of these is straightforward, and the transcript becomes a document rather than raw text. For anything that will be published or cited — journalism, research, podcast show notes — accurate attribution is not optional, and getting it from the transcript instead of reconstructing it from the audio is where the time is saved. For internal work like meeting notes, interview research, or HR records, the labeled structure is just as valuable: scanning a dialogue is far faster than reading a wall of text.

For a full walkthrough in context, see how to transcribe an interview, which covers recording, transcription, and relabeling for journalism and research, and how to transcribe a podcast for labeling hosts and guests across an episode.

Speaker labels are free on every Hushscript transcript

Speaker diarization is included on every Hushscript transcript at no extra cost. There is no diarization add-on, no separate plan tier for it, and no cap on how many speakers a recording can have. Hushscript itself is pay-as-you-go with no subscription — there are 30 free minutes to try, and you pay only for the minutes you use after that; see the pricing page for the details.

Speaker separation comes free because a transcript of a conversation without it is only half a transcript — readable text with no idea of who said it. Drop in a recording, get the labels automatically, rename them in a click, and export the result. More on how it all fits together is on the audio to text page.

Sıkça sorulan sorular

What is speaker diarization?

Speaker diarization is the process of segmenting an audio recording into parts that correspond to distinct speakers. The result is a transcript where each line is attributed to the speaker who said it — 'Speaker A said this, Speaker B said that' — instead of one undifferentiated block of text. It answers the question 'who spoke when?' automatically, before anyone reads the transcript.

Is speaker diarization the same as voice recognition?

No. Diarization separates voices within one recording and tells you the same person said these segments, but it doesn't know who that person is. Voice recognition (speaker identification) compares a voice against a known reference sample to put a name to it. Diarization needs no reference and builds no voice profile; that's why it labels speakers 'Speaker A' and 'Speaker B' rather than by name.

How many speakers can diarization handle?

Two to four speakers is the reliable sweet spot for most AI engines, and Hushscript has no paid cap on the speaker count. Accuracy gradually degrades as more voices are added, so large group recordings — panels, town halls, audience Q&A — are harder than a one-on-one conversation. Clear, separated audio matters more than the raw number of people.

Why did the transcript split one person into two speakers?

Usually a big change in how that person sounds partway through: they moved closer to or further from the mic, switched from a headset to speakerphone, or the line quality changed. The engine clusters by voice characteristics, so a large shift in those characteristics can read as a new voice. Merging the two labels in the editor fixes it in one pass.

Can diarization tell me who a speaker is by name?

No. Diarization separates speakers but doesn't identify them by name — it labels them 'Speaker A', 'Speaker B', and so on. You assign real names yourself in the editor after the transcript is generated, and each rename applies to every line that speaker said.

Does speaker diarization work on phone call recordings?

Yes, when both sides of the call are captured. If the recording is a single mixed track with both voices, diarization separates them like any other two-speaker file. If each side was saved as its own separate mono file, combine them into one mix first so both voices are present in the same recording.

Is speaker diarization included with every Hushscript transcript?

Yes. Every transcript includes speaker labels at no extra cost, with no separate add-on and no higher tier required for them. You can rename the labels to real names in one click after transcription, and the new name updates throughout the whole transcript.