How do I label the host and each guest in the transcript?

Hushscript separates speakers automatically with labels like 'Speaker A' and 'Speaker B'. After transcription, click any label in the editor, type the real name, and it updates everywhere that speaker appears. A two-person episode takes about a minute to relabel; a panel with four voices takes a few minutes.

Are speaker labels an extra cost?

No. Speaker labels are included free on every transcript Hushscript produces, with no per-speaker fee and no cap on the number of voices. Many tools gate this behind a higher tier; here it is part of the standard output.

Can I transcribe a video recording of the podcast?

Yes. If you have a video file such as MP4, MOV, or MKV, the audio is extracted in your browser before anything uploads, so the video stays on your device. Only the extracted audio reaches the server, which also keeps a large video file from clogging your connection.

How long an episode can I transcribe in one go?

Up to 10 hours or 2 GB per file, with no daily caps. A 60-minute episode, a two-hour deep dive, or a four-hour live recording all go through as a single file, so you relabel the speakers once rather than stitching parts together.

What export formats does Hushscript produce?

TXT, DOCX, SRT, and JSON, with no watermark. For show notes, DOCX gives you a formatted document to edit directly. SRT gives you timed captions if you publish a video version. JSON gives you utterance-level data if you feed a podcast host that accepts transcript uploads.

Do I need to transcribe the whole episode to write show notes?

For accurate timestamps and verbatim quotes, yes. You can sketch rough notes from memory, but chapter markers, exact quotes, and a searchable full transcript all come from the complete text. The 30-second preview shows you the quality before you commit to the full episode.

Will background music or an intro jingle confuse the transcription?

Music between segments is usually skipped or left untranscribed, and a short jingle under a voice rarely causes problems. If a music bed runs under continuous speech, accuracy on that stretch can dip. Recording voices clean and adding music in your editor afterward gives the best transcript.

How accurate is the transcript for a clearly recorded podcast?

On a clean two-mic recording with little crosstalk, the text is accurate enough to publish after a quick proofread. Accuracy depends mostly on recording quality, accents, and how often guests talk over each other, not on the file format you upload.

How to Transcribe a Podcast and Make Show Notes

June 6, 2026

A podcast transcript earns its keep twice. It makes the episode searchable and indexable, which helps people find the show and makes it accessible to listeners who read rather than listen. It also gives you the raw material for show notes, pull quotes, and captions without re-listening to the recording. This guide walks through transcribing an episode, labeling every host and guest, and turning the output into a usable show-notes document.

The part that saves the most time is speaker separation. A transcript that reads as one undivided block of text is nearly as much work to use as the audio itself. Hushscript labels each voice automatically and includes that for free on every transcript, so the conversation comes back as a conversation.

What you need before you start

A recording of the episode, in any common audio or video format. If you exported a plain audio file from your editor (MP3, WAV, M4A), that works directly. If your only copy is the video recording from Zoom, Riverside, or Ecamm, that works too, and the audio is pulled from it in your browser so the video never uploads.

Clean source audio helps more than anything else. Speaker labels are most accurate when each voice is distinct and people are not constantly talking over each other. If you recorded each participant on a separate track, you already have the cleanest possible input. A single mixed track still works; it just leans harder on the diarization to tell voices apart. If you can, record voices dry and add the music bed afterward in your editor, since a jingle running under continuous speech is the one thing that reliably nudges accuracy down.

You will need an account to transcribe a full episode. The 30-second preview is the no-account step, so you can check the speaker labels and the text quality on your actual file before signing up for anything.

Transcribe the episode

The flow follows the real funnel order: preview first, then sign up, then transcribe the rest.

Drop the file at /podcast-transcription. Hushscript clips the first 30 seconds in your browser and returns a speaker-labeled preview. No account is needed for this step. You see exactly how the host and guest get split and how clean the text reads on your own audio.
Sign up to transcribe the rest. Once you enter your email and sign in, you upload the full episode. New accounts get 30 free minutes to try, granted once. The quickest way to claim them is a $1 card check that is authorized and then released right away, never charged. If you prefer an alternative payment method available in your country, the 30 minutes arrive with your first purchase instead; a card is not required.
Let the transcription run, then export. The transcript comes back with a label and a timestamp on every utterance. Export DOCX for a document you will edit into show notes, TXT for a clean block to paste into your CMS, SRT for captions on a video version, or JSON if your podcast host accepts utterance-level transcript uploads.

A 60-minute episode runs comfortably inside the free minutes, which is enough to take one full episode from recording to published show notes before you spend anything.

Label every host and guest

Speaker diarization runs automatically, so you do not turn anything on. For a two-person episode (one host, one guest), the transcript comes back cleanly split, with every line attributed to one of two labels. For a panel, Hushscript separates as many distinct voices as it hears, with no speaker cap.

To replace the generic labels with real names:

Click any instance of Speaker A in the editor.
Type the name, such as “Alex” or “Host”.
Every instance of that label updates throughout the transcript at once.

You relabel each voice once, not line by line. A two-person episode takes about a minute; a four-person panel takes a few. After relabeling, a stretch of the transcript reads like this:

Alex [00:03:12]: So you were running the company at that point. What were the first signs?
Jamie [00:03:19]: Honestly, it was the numbers. We had a 40% drop in one quarter.
Alex [00:03:27]: And you couldn't explain it at the time?
Jamie [00:03:35]: Not for weeks. That was the part that kept me up.

That is the structure you need for pulling quotes and writing timestamped chapters. Because the labels carry through the whole file, a long episode is labeled exactly once. For the mechanics behind voice separation, see how speaker diarization works.

When the labels need a fix

Diarization is accurate on clean recordings but not infallible, and a couple of patterns are worth knowing so you can correct them quickly.

If two guests have very similar voices, an occasional line can land on the wrong speaker. Scan the transcript with the audio open and reassign the few misattributed lines by hand. This is far faster than typing speaker names from scratch, because you are only fixing exceptions.

If a quiet co-host who barely speaks gets merged into another label, give the main speakers their real names first, then sweep for the handful of lines that should belong to the quiet voice. On a well-separated recording you rarely hit either case, which is why a separate track per person is worth the small extra setup at record time.

Common issues, and what actually causes them

Most transcription problems trace back to the recording, not the tool. A few are worth planning around before you publish a show.

Crosstalk and people talking over each other. This is the hardest thing for any speaker-separation system, because two voices in the same instant cannot be cleanly split. The text usually stays readable, but a moment of heavy overlap may land on one speaker or blur at the seam. Good podcast hygiene already helps here: a host who lets guests finish produces a cleaner transcript as a side effect. Where it does happen, fix those few lines by hand against the audio.

Remote guests recorded on one track. A call recorded as a single mixed file is harder to separate than the same call captured as a track per participant. If your tool supports local per-participant recording, use it; if you only have the mixed file, the transcript still comes through, it just relies more on the diarization. Either way the audio is the same upload, so there is nothing different to do at transcription time.

Strong accents and code-switching. Hushscript detects the language automatically and transcribes around 99 languages, with top-tier accuracy in eighteen of them, so a guest with a pronounced accent in a supported language is handled well. A guest who switches between two languages mid-sentence is the genuinely hard case; expect to clean up the switched phrases by hand. For the full list of supported languages, see the languages page.

Intro music and stingers. A music bed between segments is generally skipped rather than transcribed as nonsense. Trouble starts only when music runs under continuous speech for a long stretch. The clean fix is editorial: record voices dry and lay the music in afterward, so the transcription only ever sees speech.

Names, jargon, and brand spellings. A specialist guest will use terms and product names the model may render phonetically. These are quick to fix because they repeat, and a search-and-replace on the exported document handles a recurring misspelling in one pass.

A few accuracy tips

The single biggest lever is recording quality, so the effort spent on microphones and a quiet room pays back at transcription time. Beyond that, a track per speaker gives both the cleanest text and the cleanest speaker labels, since the system never has to untangle two voices from one waveform. Keep a guest’s mic close and consistent rather than drifting, and you will spend your editing time on substance instead of repairs.

When you do proofread, read with the audio open and fix in passes: speaker labels first, then misheard names and jargon, then the occasional overlap. Working by category is faster than reading top to bottom, and it leaves you with a document clean enough to publish.

From transcript to show notes

Show notes are not a summary of everything said. They are a navigation aid for listeners and a search surface for people who have not pressed play yet. A structure that works for most interview and conversation shows:

Episode summary, two to four sentences. Pull the core topic from the transcript. What is the central argument, story, or takeaway? Write it for someone deciding whether to listen.

Timestamped chapters. Scan the transcript for topic shifts. Each time the conversation turns to a new subject, note the timestamp and write a one-line description. The timestamps come straight from the transcript, so you never re-listen to find them.

[00:02:15] Launching the company with no marketing budget
[00:14:30] The 40% revenue drop, and what caused it
[00:28:44] Rebuilding: what changed internally
[00:51:00] Advice for founders in the same position

Key quotes. Pull two or three lines that capture the episode’s main claims or its most quotable moments. Quote verbatim from the transcript and attribute each to the speaker by name. Because the text is searchable, finding the exact wording of a line you half-remember takes seconds.

Guest links and resources. Add anything the guest mentioned. To find spoken URLs fast, search the transcript for fragments like “dot com” or “slash”; people say web addresses out loud more often than you would expect, and a verbatim transcript captures them.

Full transcript, optional but valuable. Paste the full text below the notes or link to a separate transcript page. This is where the search value lives, because the spoken content becomes indexable. If your host platform accepts transcript uploads, the JSON export carries utterance-level data; otherwise the plain text is enough.

A worked example

Say you recorded a 58-minute founder interview as two tracks and mixed them down to one MP3. You drop the MP3 at /podcast-transcription, watch the 30-second preview split the host and guest correctly, sign up, and upload the full file. The transcript returns with Speaker A and Speaker B; you rename them to “Alex” and “Jamie” in two clicks. From there, the summary comes from the opening two minutes, the four chapters above come from scanning for topic shifts, and the two pull quotes come from searching the text for the moments you remember. Start to finished show notes is roughly fifteen minutes, most of it your own editing rather than waiting or transcribing.

Add captions or subtitles

If you publish a video version of the episode to a platform such as a video host, Spotify video, or LinkedIn, the SRT export is a timed subtitle file ready to upload. There is no extra step, because the transcript you already generated carries the timing data. For the broader workflow of getting captions onto a clip, including burned-in versus sidecar files, see how to add subtitles to a video.

If your starting point is a video recording rather than an audio export, the process is identical; the audio is extracted in your browser first. The video to text page covers that path in full.

Long and multi-part episodes

For a feature-length episode or a live recording, Hushscript takes up to 10 hours or 2 GB per file. A four-hour panel or a documentary-format episode goes through as one file rather than being split into chunks, which matters for two reasons. The speaker labels stay consistent across the whole file, so you relabel once. And the timestamps are continuous, so a chapter at the three-hour mark reads [03:12:40] rather than restarting from zero in part two.

If you record a multi-part series in one session, transcribe the whole session as a single file and split the show notes by part afterward. Cutting the audio first only multiplies the relabeling work.

Why this gets fast

For a regular publishing cadence, weekly or twice weekly, the transcribe-to-show-notes routine compresses to minutes per episode once you have done it a few times: drop the file, relabel the speakers, scan for chapters, pull quotes, export. The speaker labels doing the attribution for free is what removes the tedious part. You spend your time on the editorial judgment that show notes actually need, not on figuring out who said what.