Skip to main content

How to Transcribe Any Audio File to Text

May 8, 2026

An MP3 from a podcast editor, an M4A from your iPhone, a WAV from a field recorder, an obscure CAF from a voice-recorder app — to transcribe any of them, you do the same thing. Drop the file, get a transcript with the speakers already separated, export it in the format you need. The container the audio came in doesn’t change the steps.

What the format does affect is a little further down: how big the file is, how cleanly it transcribes, and what to do on the rare occasion a file won’t load. This guide covers which audio formats work, the one method that handles all of them, how to pick the best format when you have the choice, and how to fix the files that come out rough.

What you need

The 30-second preview needs no account — you can drop a file and see the speaker-labeled style before signing up, which is the honest way to check the output fits before you commit anything. Beyond the free minutes, transcription is pay-as-you-go: you buy minutes only when you need them, with no subscription. Costs are on the pricing page.

Which audio formats work

Hushscript takes all the standard audio formats. Here’s where each tends to come from, so you can recognise what you’ve got:

Format Common source
MP3 Podcasts, phone recordings, exports from most recording apps
M4A / AAC iPhone Voice Memos, WhatsApp voice notes, Apple exports
WAV Digital recorders, DAW exports, Windows Voice Recorder
FLAC Archival recordings, DAW exports, lossless rips
OGG Android voice notes, open-source recording tools
AIFF / CAF Mac and iOS audio, GarageBand

Video files work the same way. Drop an MP4, MOV, WebM, or MKV and the audio track is extracted in your browser before anything uploads, so the video itself never leaves your device — only the audio reaches the server.

The list above is not a fence. The promise is simpler than a supported-formats table: just drop your file and it gets converted and transcribed. If you have something genuinely unusual — an AMR file off an old Nokia, a .3gp from a decade-old Android, a RealAudio clip — try uploading it anyway. The browser can read most standard containers, and if it can’t, you have two fallbacks: convert it to MP3 or WAV with the free in-browser tools at /tools and upload that, or email support@hushscript.com and we’ll add the format. You don’t have to figure out in advance whether your file qualifies.

Transcribe an audio file in three steps

The flow is identical no matter which format you start with. The first step happens before you have an account at all.

  1. Drop your file to preview it. Go to /audio-to-text and drag the file onto the drop zone, or click to browse. The first 30 seconds transcribe right there in the browser, speaker-labeled, with no account needed. This is your check that the output reads the way you want before you sign up for anything.
  2. Sign up to transcribe the rest. Enter your email to create an account. Your 30 free minutes unlock when you add and validate a card (the $1 hold described above) or with your first purchase if you pay another way. The full file uploads from here — the per-file ceiling is 10 hours and 2 GB, and there’s no daily cap on how many files you run.
  3. Get and export the transcript. The speech engine processes the file and returns a transcript with speakers already separated. Rename the speakers if you like, then download as TXT, SRT, DOCX, or JSON. Every export is included — no paywall, no watermark.

Processing runs in the background and scales with the audio’s length — roughly a few minutes per hour of recording — so you don’t have to keep the tab open while a long file finishes.

A worked example: one recording, two formats

Say you recorded a 38-minute conversation and your recorder saved it twice — a meeting.wav at full quality and a meeting.m4a your phone made at the same time. Both go through the identical three steps, and the transcript comes back segmented by speaker turn, with a timestamp on each:

[00:00:06] Speaker A: Before we start, did everyone get the figures I sent over?
[00:00:10] Speaker B: Got them this morning. The Q3 number is the one I want to
           dig into.
[00:00:17] Speaker A: Right, so that's the line that moved. Let me pull it up.
[00:00:21] Speaker C: While you do — can we come back to hiring after?

The two files produce essentially the same transcript. The WAV is a much larger upload and the M4A a small one, but the words and the speaker labels land the same way, because both carry the same underlying speech. From here the editing is light: you click Speaker A once to rename it, and every line that speaker has updates; you skim for the handful of proper nouns the engine guessed at — a surname, a product name — and fix them with find-and-replace, which corrects every instance in one pass.

That is the practical upshot of one method for every format. You don’t keep a different tool or a different mental checklist for WAV versus M4A versus MP3. Whatever your recorder, your phone, or someone else’s export handed you, it goes through the same drop zone and comes out as the same kind of transcript.

Picking the best format for accuracy

Most of the time you don’t get to choose — you transcribe the file you were given, and that’s fine. But if you control how the audio is recorded or exported, a few choices set the ceiling on how clean the result can be.

Recording quality comes first, by a wide margin. A 128 kbps MP3 from a good close microphone will beat a lossless WAV recorded across the room on a laptop’s built-in mic every time. Before you think about format, think about getting the mic close to whoever is speaking. The container is a distant second.

Avoid very low-bitrate MP3. MP3 is lossy: encoding discards parts of the audio to keep the file small, and the lower the bitrate, the more it throws away. Speech recognition leans on the high-frequency detail in consonants — the gap between “fifteen” and “sixteen,” or “can” and “can’t,” lives there — and aggressive compression eats exactly that detail first. Below about 64 kbps, word errors climb. At 128 kbps and above you’re already past the point where the number matters; a higher bitrate won’t make the transcript better.

Lossless removes format as a variable. WAV, FLAC, and AIFF hand the engine uncompressed audio. For a clean recording the accuracy gain over a good MP3 is marginal, but it takes compression off the table entirely — useful when the recording is precious and you don’t want to wonder later whether the format cost you anything.

Mono is fine; you don’t need stereo. The engine mixes stereo down to mono internally for speech, so a mono file transcribes just as well and uploads faster on a slow connection. The one nuance: if your setup recorded each person on a separate stereo channel (a “double-ender”), mixing it down to a single mono track before uploading tends to help the speaker labels, because the engine hears one blended conversation rather than two isolated streams.

Sample rate above 16 kHz buys nothing for speech. Anything from 16 kHz to 48 kHz works, and a 48 kHz WAV from a video editor transcribes the same as a 16 kHz mono file from a phone app, given the same underlying speech. The model is trained on voice, not music, so there’s no accuracy gain from a higher rate.

Lossless vs lossy — and when converting helps

A common question: would converting your MP3 up to WAV first improve accuracy? It won’t. Once audio has been saved as a 128 kbps MP3, the discarded detail is gone for good — wrapping that same audio in a WAV just makes a bigger file with no extra information. Converting upward only ever helps as a workaround for a format that won’t load, never as an accuracy boost.

Converting down is the genuinely useful case. A multi-hour WAV can be a very large file; re-encoding it to a 128 kbps MP3 before uploading shrinks it dramatically with no meaningful accuracy cost, which speeds the upload on a slow link. The free in-browser converter does this without the audio ever leaving your device. The honest rule: if your file already loads and isn’t tiny-bitrate, transcribe it as-is; only convert to solve a real problem — a file that won’t open, or one too large to upload comfortably.

Speaker labels, included free

Every transcript comes with automatic speaker separation — diarization — at no extra cost. The engine picks out the distinct voices and labels them Speaker A, Speaker B, and so on, and you rename them to real names in the editor with one click per speaker. Most tools gate this behind a paid tier or cap the number of speakers; here it’s on by default, on every transcript, regardless of format.

How clean the labels come out depends on the recording, not the file type:

For most interviews, podcasts, and meetings the labels come out usable with at most a handful of reassignments. If you want the detail on how diarization works under the hood, see what speaker diarization is, or the speaker identification page for the feature overview.

Fixing files that come out rough

Most rough transcripts trace back to the recording, not the format or the tool. The usual culprits and what to do about them:

A file that won’t load. Rare, but it happens with an unusual or corrupted container. Try the free in-browser converter to re-wrap it as MP3 or WAV first, which fixes most stubborn files. If it still won’t go, email support@hushscript.com — adding format support is something we do.

Low volume. If the recording is very quiet, the engine has less signal to work with and accuracy slips. Normalize or amplify the audio in any editor before uploading — boosting a weak recording’s level is one of the highest-return fixes there is.

Background noise. Steady noise — an air conditioner, road hum — is handled reasonably well. Sudden noise that overlaps speech — a door, a cough, cutlery — is what drops words. A light noise-reduction pass before uploading can help, but don’t overdo it: heavy denoising adds artifacts that hurt accuracy more than the noise did.

Heavy accents or specialist vocabulary. Modern engines handle a wide range of accents well. The reliable misses are domain terms the engine has no reason to expect — drug names, legal citations, niche brand names. Plan on one skim after export and lean on find-and-replace; if a name lands wrong the same way every time, one correction sweeps the whole file.

Very large or very long files. The 10-hour, 2 GB ceiling covers nearly anything, so there’s no need to chop a long recording into pieces. If a lossless file is awkwardly large to upload, re-encode it to 128 kbps MP3 first (no real accuracy cost) to shrink it. For the longer-session specifics, the long-recording guide goes deeper.

What happens to your file

The audio is deleted from the server the moment the transcript is ready. There’s no cloud storage, no use for model training, and no retention period. If you dropped a video file, only the extracted audio ever reached the server — the video stayed on your device — and you can delete the transcript itself from your dashboard in one click whenever you want.

Exporting your transcript

After transcription, pick the format that fits what you’re doing with the text:

Format What’s included Best for
TXT Speaker labels, plain text Notes, search, feeding into another tool
SRT Per-utterance timestamps, speaker labels Video captions, navigating a long recording by time
DOCX Formatted for Word or Google Docs Editing, sharing, annotation
JSON { speaker, start, end, text } per item Pipelines and custom tooling

Every export carries the speaker labels, with no watermark on any of them, and all four are included with every plan — including the 30 free minutes. Timestamps in the SRT sit at the utterance level (one entry per speaker turn), which is the right granularity for captioning and quoting; if you need word-level structure to process in code, JSON gives you the start and end time in milliseconds for each turn.

Languages

Hushscript transcribes around 99 languages, detected automatically — you don’t set the language by hand. Accuracy is strongest in 18, including four English variants (Global, British, Australian, US), Spanish, French, German, Japanese, Mandarin, Hindi, and Arabic. The full list is on the languages page.


That’s the whole job — drop, transcribe, export — and it reads the same whatever format you started with. For format-specific notes, the sibling guides cover the most common cases: MP3, WAV, and M4A / Apple Voice Memo. The audio-to-text page has the full feature overview, and every format lands there the same way — just drop your file.

Frequently asked questions

Which audio formats does Hushscript accept?

MP3, M4A, WAV, FLAC, AAC, OGG, CAF, AIFF, and the other common audio formats. Video files (MP4, MOV, WebM, MKV) work too — the audio is extracted in the browser before upload, so the video stays on your device. If a file you have isn't supported, email support@hushscript.com and we'll add it.

Is there a format that gives the best accuracy?

Lossless formats (WAV, FLAC, AIFF) hand the engine the cleanest signal, but a 128 kbps or higher MP3 of the same recording transcribes about the same. The recording itself — mic distance, noise, how clearly people speak — matters far more than the container format.

What if I have a format that isn't in the list?

Try uploading it first. The browser can read most standard containers, even old or unusual ones. If it won't load, convert it to MP3 or WAV with the free in-browser converter at /tools, then upload that. As a fallback, email support@hushscript.com and we'll add the format.

Do I need a credit card to start?

No. A card is just the fastest way to the 30 free minutes — a $1 hold validates it, then releases right away and is never charged. If you'd rather use another payment method available in your country, the 30 minutes arrive with your first purchase instead.

Are speaker labels included for every format?

Yes. Speaker separation runs on every transcript, whatever the format or file size, at no extra cost. Labels come back as Speaker A, Speaker B, and so on, and you can rename them.

How are minutes counted across formats?

Against the audio duration, not the file size. A 45-minute FLAC and a 45-minute MP3 of the same recording cost the same number of minutes, even though the FLAC is a much larger file.

Can it transcribe a file that isn't in English?

Yes. The language is detected automatically across around 99 languages — you don't set it manually. A clean single-language recording transcribes best; heavy switching between languages mid-sentence is harder for any engine.

What happens to my file after it's transcribed?

It's deleted from the server the moment the transcript is ready — no storage, no model training. If your file is a video, only the extracted audio ever reaches the server, and you can delete the transcript itself from your dashboard in one click.