How to Transcribe an Interview (Privately)
May 23, 2026
Transcribing an interview well means two things at once: accurate text and clear attribution. A wall of words where you cannot tell who said what is almost as much work to use as the raw recording. And when the conversation is sensitive, a legal matter, a confidential source, an HR situation, how the audio is handled matters as much as the transcript itself. This guide covers the whole arc: getting clean audio before you record, running the transcription, keeping interviewer and subject apart, quoting accurately with timestamps, and keeping the whole thing private.
What you need before you start
The list is short. You need a recording of the interview as an audio or video file. Hushscript accepts any common format, so a phone voice memo, a Zoom recording, a dedicated recorder’s WAV, or an MP4 screen capture all work without conversion. You need the consent of everyone recorded, which is covered below and is not optional. And you need a way to view and edit the transcript, which is the Hushscript app in your browser. No installs, no plugins.
You do not need to pick a language. Hushscript detects it automatically across around 99 languages, so a bilingual interview or a subject who switches mid-sentence is handled without you setting anything. If you want to see the full list, the languages page has it.
Before you record
Audio quality at recording time is the single biggest factor in transcript accuracy. No transcription engine, automatic or human, can recover a word that the microphone never captured clearly. A few specifics pay off more than anything you do later.
Microphone placement. For an in-person interview, a small recorder placed on the table equidistant from both speakers beats a phone left in your pocket. If you can give each person their own microphone, do it: two clean inputs are easier to separate than one mixed room. For remote interviews on a video call, record each side as its own track when the platform allows, because two tracks separate more cleanly than a single down-mixed stream.
A quiet environment. HVAC hum, traffic through a window, a café’s background chatter, and bare-room reverb all chip away at accuracy. A small carpeted room, or even a closet of hanging clothes, sounds better than an open office. In the field, a directional microphone pointed at your subject helps lift their voice above the surroundings.
Consent, before anything else. Recording a person without their knowledge is illegal in many places and unacceptable everywhere. Tell your subject you are recording before you begin. Some jurisdictions are one-party consent, but several, including a number of US states and many countries, require everyone present to agree. When in doubt, get an explicit spoken yes on the recording itself, right after you start, so the agreement is part of the file.
Sensible format settings. You do not need studio quality. WAV or MP3 at 128 kbps or higher is plenty, and 16 kHz sample rate or above is comfortable. The one thing to avoid is compressing down to telephone quality, around 8 kHz, where consonants smear together and accuracy drops noticeably.
Transcribe the interview, step by step
The flow is built so you can see the quality before you commit. Here is the real order.
- Drop your recording at /audio-to-text. If it is a video, the audio is extracted in your browser first, so the video file itself never leaves your device. Only the audio is involved from here.
- Read the 30-second preview. Hushscript transcribes the first half-minute and shows it back to you with speaker labels already applied, before you create an account. This is the no-account step: you get to judge the speaker separation and the wording on your actual file, not a demo.
- Sign up to transcribe the rest. Creating an account unlocks the full transcription. New accounts get 30 free minutes to try, enough for a short interview or a slice of a long one. Hushscript is pay-as-you-go after that with no subscription, so you buy minutes only when you need them.
- Relabel and export. The finished transcript comes back as one line per utterance, each tagged with a speaker label and a timestamp. Rename the labels to real names, then export.
A note on the free minutes, since people ask how the “free” part works. The 30 minutes are granted once. The quickest way to unlock them is a quick card check: a $1 hold is authorized to confirm the card and then released right away, never charged. If you would rather pay another way, an alternative method available in your country works too, and the 30 minutes arrive with your first purchase. A card is not required either way.
Keep the interviewer and subject separate
This is where an interview transcript becomes usable rather than a transcript you have to re-listen to anyway. The speaker identification feature assigns every distinct voice its own label, with no setup and no extra cost. For a two-person interview that means dialogue like this:
Speaker A [00:00:07]: Can you take me back to when you first realised the project was in trouble?
Speaker B [00:00:13]: It was early 2019. I was running the rollout, and the numbers just stopped adding up.
Speaker A [00:00:28]: And what did you do about it?
Speaker B [00:00:31]: Honestly, at first, nothing. That is the part I regret.
To put real names on it:
- Click any instance of
Speaker Ain the transcript. - Type the name you want, for example “Interviewer” or your subject’s name.
- Every line by that speaker updates at once.
You do this once per speaker. For a 90-minute, two-person interview the whole pass takes under a minute, and you end with a quote-ready document where every line is attributed to a named person. If you want the detail of how this separation actually works under the hood, how speaker diarization works walks through it in plain English.
A worked example: a one-hour research interview
Say you have recorded a 58-minute research interview on your phone as an .m4a file. The room was quiet, both of you were close to the phone, and there is the usual bit of overlap when your subject got animated.
You drop the file at /audio-to-text. The 30-second preview comes back almost immediately, showing your opening question as Speaker A and your subject’s first answer as Speaker B, which tells you the separation is clean. You sign up and let the full file transcribe. A 58-minute interview is well inside your 30 free minutes plus a small top-up, and it finishes in a few minutes rather than the hours hand-transcription would take.
The transcript reads as a clean back-and-forth. You click the first Speaker A, type your own name, click the first Speaker B, type your subject’s name, and the whole document updates. Skimming through, you notice three short stretches where the engine split your subject into a second label because they leaned away from the phone, so you merge those back. You export to DOCX for your notes and to JSON so your analysis script can pull every utterance with its start time, end time, and speaker. Total hands-on time after the upload: a few minutes.
Quote accurately with timestamps
Every utterance in the transcript carries a timestamp, which is what makes the output safe to quote from. When you pull a quote for an article, a report, or a paper, note the timestamp beside it. That gives you a precise point to listen back to during fact-checking, gives an editor a way to verify the quote, and gives you something to point to if a source later disputes what they said.
The DOCX export keeps the timestamps inline. If you need the raw structure, the JSON export hands you utterance-level data, each entry with a start time, end time, speaker, and text, which you can process programmatically. The TXT export is the clean reading copy, and SRT gives you timed caption lines if the interview is going to sit under a video.
One limit worth knowing: timestamps are at the utterance level, not the individual word. For most interview work that is exactly what you want, since you are citing a sentence, not a syllable. If you specifically need word-level timing for burning subtitles onto footage, the SRT line timing is close enough for nearly every purpose.
Common problems and how to fix them
Most interview recordings transcribe cleanly. When something is off, it is almost always one of a handful of issues, and each has a straightforward fix.
One speaker split into two labels. The most common surprise. It usually means that person’s sound changed partway through: they moved nearer or further from the mic, switched from a headset to speakerphone, or the line quality shifted on a call. The engine clusters by how a voice sounds, so a big shift can read as a new voice. Merge the two labels in the editor and it is fixed in one pass, as in the worked example above.
Two people merged into one label. The reverse. This happens when two voices are genuinely similar, or when the recording is quiet and the differences are hard to hear. It is harder to fix cleanly after the fact, which is why two separate inputs at recording time, one per person, is the single best insurance against it.
Crosstalk and people talking over each other. When both speak at once, any transcriber, automatic or human, struggles, because the words are physically overlapping in the audio. Expect the busiest moments of a heated interview to need a manual listen-back. The timestamps make finding those moments quick.
Strong accents or specialist terms. Accuracy holds up well across accents, but unfamiliar proper nouns, technical jargon, and names are where automatic transcription is most likely to slip. Do a search-and-replace pass for the handful of names and terms specific to your interview, and confirm any you intend to quote.
A quiet or distant recording. If the subject was far from the mic or spoke softly, accuracy drops across the board, and there is no software fix after the fact. This is the case the “before you record” section exists to prevent. If you are stuck with a weak recording, transcribe it anyway as a draft, then correct against the audio.
A very long or large file. Files up to 10 hours and 2 GB go in as a single upload, so a long oral-history session does not need splitting. Only if a recording runs past those limits do you split it, at a natural pause, and transcribe each part.
Audio source or video source: which to upload
If you have both an audio file and a video of the same interview, upload either. The transcript is identical, because Hushscript works from the audio in both cases. The practical difference is privacy. When you give it a video, the audio is extracted in your browser and only that audio is sent, so the video, often the more identifying and more sensitive artifact, never leaves your machine. For a sensitive interview that is the better default. If you already have a clean audio-only file, uploading that is simplest.
Keeping sensitive interviews private
For interviews touching legal matters, confidential sources, HR cases, or personal disclosures, how the audio is handled matters beyond transcript quality. Hushscript is built for exactly this.
The audio is deleted from the server the moment the transcript is ready. There is no retention window and no backup copy of the recording sitting somewhere afterwards. If you uploaded a video, only the extracted audio was ever sent, and that is gone too. The speech engine does not train on your recordings, so nothing you submit feeds a model.
The transcripts themselves are encrypted at rest. In plain terms, if the storage were ever breached, what an attacker would find is unreadable ciphertext, not your subject’s words. This is leak protection, not a claim that we cannot read your transcripts: the key is held on our side so the product can show you your own work. It does mean that a data leak, the realistic threat for any cloud tool, does not hand over readable interview content.
For the strongest cases, source protection in journalism, privileged legal material, take the standard precaution: keep your own offline copy of the audio and the finished transcript on a device you control, and treat no cloud service, this one included, as the only copy. If you want the full picture of the data handling, the private transcription page lays it out.
If your recording is a multi-host podcast rather than a one-on-one interview, the workflow is the same and how to transcribe a podcast covers the multi-voice specifics.