How to Convert MP3 to Text (Free, with Speakers)
April 20, 2026
Converting an MP3 to text is one of the most common transcription jobs there is, and the steps don’t change with the source: a podcast episode, a recorded interview, a lecture, or a voice memo off your phone all go through the same flow. Upload the file, wait for the transcript, export it in the format you need.
What does change — and what most guides skip — is how the MP3 itself affects the result. An MP3 is a lossy, compressed file, and the bitrate it was saved at sets a ceiling on how much detail the speech engine has to work with. This guide covers the three steps, then goes deeper: a worked example with real speaker-labeled output, what bitrate actually means for accuracy, and how to fix the recordings that come out rough.
What you need
- An MP3 file. Any bitrate will load. For accuracy, 64 kbps is the practical floor and 128 kbps is comfortably past the point where the number matters — more on that below.
- A Hushscript account. You get 30 free minutes to try. The quickest way to unlock them is to add a card: a $1 hold validates it, then releases right away and is never charged. If you’d rather not use a card, an alternative payment method works too, and the 30 minutes land with your first purchase. A card isn’t required either way.
There’s nothing to install — no desktop app, no browser extension. Everything runs in the browser and the account.
Beyond the free minutes, transcription is pay-as-you-go: you buy minutes only when you need them, with no subscription. Costs are on the pricing page.
Convert an MP3 to text in three steps
- Upload your file. Open /audio-to-text, then drag the MP3 onto the drop zone or click to browse. Files up to 2 GB and 10 hours are accepted. The first 30 seconds preview right in the browser before you sign in, so you can see the speaker-labeled style before committing.
- Let it transcribe. Once you’re signed in and the file uploads, the speech engine processes it and returns a transcript with speakers already separated. A 60-minute MP3 typically finishes in a few minutes; it runs in the background, so you don’t have to keep the tab open.
- Export. Download the result as TXT, SRT, DOCX, or JSON. Every format is included — there’s no paywall on exports and no watermark on any of them.
The finished transcript opens in your dashboard. Speakers appear as Speaker A, Speaker B, and so on, and you can click any label to rename it to a real name — the change applies to every line that speaker has.
A worked example: a two-person interview
Say you have founder-interview.mp3 — a 42-minute recording, 128 kbps, one interviewer and one guest, recorded over a call. After you upload it, the transcript comes back segmented by speaker turn, with a timestamp on each:
[00:00:04] Speaker A: Thanks for making the time. Let's start at the beginning —
where did the idea actually come from?
[00:00:11] Speaker B: Honestly, it came out of a problem we had ourselves. We were
drowning in recorded calls and none of them were searchable.
[00:00:23] Speaker A: So you built the thing you needed.
[00:00:25] Speaker B: Pretty much. The first version was held together with tape.
From there the editing is light. You rename Speaker A to the interviewer and Speaker B to the guest once, and every line updates. You skim for the handful of proper nouns the engine guessed at — a product name, a person’s surname — and fix them with find-and-replace, which corrects every instance in one pass. For a clean 128 kbps interview like this, that skim is usually all the cleanup needed before the transcript is quotable.
This is where transcription that labels speakers for free earns its place. A wall of undivided text from an interview is nearly useless until you’ve gone back through and marked who said what; a transcript that arrives already split into turns is something you can quote from straight away.
What bitrate means for accuracy
MP3 is a lossy format: encoding throws away parts of the audio judged least audible to keep the file small. Bitrate is how many kilobits per second the encoder spends doing that — and the lower it is, the more it discards.
Speech recognition leans heavily on the high-frequency detail in consonants — the difference between “fifteen” and “sixteen,” or “can” and “can’t,” lives there. Aggressive compression is exactly where that detail goes first. So bitrate doesn’t degrade a transcript evenly; it eats the hardest-to-hear words at the edges of speech.
In practice:
- 32 kbps and below: noticeably compressed even to the human ear, and the engine reacts the same way. Word errors climb. Avoid this for anything you care about transcribing.
- 64 kbps: the realistic floor for reliable speech-to-text. Fine for clear, close-mic voice.
- 128 kbps and above: comfortably enough for transcription. Going higher won’t improve the transcript — the engine has all the detail it can use well before then.
A couple of clarifications that save needless re-encoding. Variable bitrate (VBR) is fine; it has no accuracy penalty over constant bitrate (CBR) for speech, so don’t convert a VBR file just to “fix” it. And mono works as well as stereo for transcription — voice doesn’t need two channels. If anything, see the note on stereo double-enders below.
MP3 vs lossless — and when to re-record instead
A natural question is whether converting your MP3 to a lossless format like WAV first would help. It won’t. Once audio has been saved as a 128 kbps MP3, the discarded detail is gone — wrapping that same audio in a WAV container just makes a bigger file with no extra information. If you’re recording fresh and have the choice, a lossless or high-bitrate source is the better starting point (see the WAV-to-text guide for that case), but converting an existing MP3 upward buys nothing.
The honest decision rule: if your only copy is a reasonable-bitrate MP3, transcribe it as-is. If the recording is genuinely bad — clipped, buried in noise, or saved at 32 kbps — and you can capture it again, re-recording will help far more than any conversion. When you do record fresh, two habits matter more than the bitrate: get the microphone close to whoever is speaking, and give each person their own mic if you can, since clean separation at the source is what makes both the words and the speaker labels come out right.
Fixing common problems
Most rough transcripts trace back to the recording, not the tool. Here’s what to do about the usual culprits.
Low volume. If the waveform looks flat — a very quiet recording — the engine has less signal to work with and accuracy slips. Normalize or amplify the audio in any audio editor first, then upload the louder version. Boosting level is one of the highest-return fixes for a weak recording.
Background noise. Steady noise (an air conditioner, road hum) is handled reasonably well; sudden noise that overlaps speech — a door, a cough, cutlery — is what causes dropped words. If you can run a light noise-reduction pass before uploading, do it, but don’t over-process: heavy denoising introduces artifacts that hurt accuracy more than the noise did.
Heavy accents or specialist vocabulary. Modern engines handle a wide range of English accents well. The reliable misses are domain terms the engine has no reason to expect — drug names, legal citations, niche brand names. Plan on one skim after export to catch these, and lean on find-and-replace: if a company name lands wrong the same way every time, one correction sweeps the whole file.
Very long MP3s. A 6-hour recording is no problem on its own — the 10-hour, 2 GB ceiling covers nearly anything, and there’s no need to chop a long file into pieces first. The thing that actually degrades a long file is drift in recording quality across it: a speaker who wanders away from the mic, a level that sags after the first hour, a venue that fills up and gets noisier. Where you can, keep the source steady; where you can’t, expect the rougher stretches to need more editing than the clean ones. The timestamps make this easy to manage — you can jump straight to the patch that reads badly instead of re-listening to the whole thing.
Overlapping speakers. When two people talk at once, the engine assigns the words to the louder voice — a limitation of speaker separation generally, not of any one tool. There’s no upload-side fix; budget editing time for the cross-talk sections of a lively group recording.
Keeping the speakers cleanly separated
Speaker separation (diarization) runs on every transcript at no extra cost, and a few things at the MP3 level affect how clean it comes out. If you want the detail on how it works under the hood, see what speaker diarization is; the practical points here:
- Stereo double-enders. Some podcast and call setups record each speaker on a separate stereo channel. Counterintuitively, mixing that down to a single mono track before uploading tends to help — the engine hears one blended conversation rather than two isolated streams. Most editors export a mono “mix-down.”
- Rapid back-and-forth. Very short exchanges (sub-two-second turns) can occasionally merge into one labeled utterance. It rarely hurts readability, and you can split a turn by hand if a specific quote needs it.
- Clean turns help most. The cleaner the hand-offs between speakers — fewer interruptions, less crosstalk — the more reliable the labels. There’s nothing to configure; it’s purely a property of the recording.
Exporting your transcript
The right format depends on what you’re doing with the text:
| Format | Best for |
|---|---|
| TXT | Notes, copy-paste, feeding the text into another tool |
| SRT | Captions on a video, or navigating by timestamp — also the right pick if your source is a video file like MP4 |
| DOCX | Sharing with editors or annotators who work in Word |
| JSON | Programmatic processing and custom tooling |
Every export includes the speaker labels and timestamps, with no watermark on any of them.
Timestamps and editing
The editor shows each utterance with its start time, speaker label, and text. Click any line to play back that segment of the audio against the text — handy for checking a tricky passage without scrubbing the whole file by hand.
Timestamps in the SRT export sit at the utterance level — one entry per speaker turn, not per word. For captioning, navigation, and quoting with a timestamp, that’s the right granularity. If you need word-level structure to process in code, export JSON: each entry gives you the speaker label, the start and end time in milliseconds, and the text for that turn.
That’s the whole job — upload, transcribe, export — with the MP3-specific details that decide how clean the result is. The mp3-to-text page has the full feature overview, and if you’re working with a recording in another container, the audio-to-text page covers every format the same way. Doing a whole show? The podcast transcription guide walks through turning the transcript into show notes, and if your file is a WAV rather than an MP3, the WAV-to-text guide has notes on handling large lossless files.