How to Convert a WAV File to Text (Free to Try)
April 27, 2026
A WAV file is uncompressed audio, which makes it a strong starting point for an accurate transcript. The trade-off is size: WAV files are large, and a long session can run past the upload limit. This guide covers the whole job — turning a WAV file into text, getting clean speaker labels, handling the big files, and exporting the transcript in the format you actually need.
WAV turns up wherever quality matters: field recorders saving to an SD card, audio mixed and exported from a DAW, Windows Voice Recorder output, and broadcast interview equipment. The transcription itself works the same as any other format — upload, transcribe, export. What is worth knowing first is why the lossless audio helps, and how to keep the file size manageable.
Why WAV is good for accuracy
WAV is an uncompressed container. Unlike MP3 or M4A, there is no lossy compression step, so the audio the engine receives is identical to what was recorded — no encoding artifacts have been added and thrown away.
In practice this matters most in two cases:
- The source recording is already borderline — a quiet speaker, a noisy room, a cheap microphone. Every bit of signal counts, and a lossy codec would discard some of it.
- You have processed the audio in a DAW and re-exported it. Each MP3 pass adds its own artifacts; keeping the file as WAV between edits avoids stacking them.
For a clean studio recording or a well-set-up interview, a 128 kbps MP3 from the same source gives near-identical accuracy. The WAV advantage is real but modest unless the conditions were difficult. So the honest answer to “should I record in WAV for a better transcript” is: it helps a little, and it helps more the worse your recording environment is.
What the engine actually uses
Speech models are trained mostly on 16 kHz mono audio. When you upload a 48 kHz stereo WAV, the engine downsamples and mixes to mono before it does anything else. The useful range for speech sits roughly between 300 Hz and 8 kHz — the telephony band. Capturing that range cleanly is what matters, and any decent microphone at 16 kHz or above already does.
The practical result: a 48 kHz WAV and a 16 kHz WAV of the same speech produce the same transcript. High sample rates do not hurt, but they do not help the words either — they just make the file bigger. If you control the recording settings and want the smallest upload without giving anything up, 16 kHz mono is the floor.
What you need
- A WAV file, up to 2 GB and 10 hours. Sample rate and bit depth do not affect accuracy above 16 kHz, so whatever your recorder produced is fine.
- A Hushscript account. After you sign up and add a payment method you get 30 free minutes — granted instantly when you add a card (a $1 hold validates it, then releases right away and is never charged), or with your first minute pack if you use another payment method available in your country. A card is not required; it is just the quicker route to the free minutes.
- Nothing to install. The whole job runs in the browser, so there is no app, plugin, or extension to set up.
Transcription is billed by audio duration, not file size. A one-hour WAV at 44.1 kHz and 16-bit is about 600 MB but costs the same as a one-hour MP3. The free minutes are enough to transcribe a short sample and check accuracy on your own recording before you commit to a full batch.
Convert a WAV file to text in three steps
- Open the converter and sign in. Go to /audio-to-text and sign in. New accounts get 30 free minutes once a payment method is added, which is enough to test a recording end to end.
- Upload the WAV file. Drag it onto the upload zone or click to browse. Files up to 2 GB are accepted directly — no need to split or compress a normal-length recording first.
- Download the transcript. When processing finishes, the transcript lands in your dashboard with speaker labels and timestamps already applied. Export it as TXT, SRT, DOCX, or JSON.
Speaker labels appear as Speaker A, Speaker B, and so on. Click a label to rename it — once you set “Speaker A” to a real name, every line from that speaker updates.
A worked example: a field-recorded interview
Say you recorded a 40-minute interview on a handheld field recorder, two people, saved as a 44.1 kHz / 16-bit stereo WAV. That file is roughly 400 MB. Here is how the job actually goes.
The file is well under the 2 GB limit, so there is no need to convert anything — you upload it as is. Because the room had some air-conditioning hum, you first run a high-pass filter at 80 Hz in your audio editor and re-export; that removes the low rumble the engine would otherwise read as noise, and re-exporting to WAV keeps the audio lossless. You upload the cleaned file.
A 40-minute recording transcribes in a few minutes. The transcript comes back split into Speaker A (you) and Speaker B (your subject), with a timestamp on each turn. You rename the two speakers, skim once for the handful of proper nouns the engine guessed at — a company name, a place — and fix them with find-and-replace. Then you export: DOCX for the readable interview, and SRT if you also need timestamps lined up to a video cut. Total hands-on time is a few minutes of cleanup on top of the automatic transcript.
That pattern — light pre-processing, upload, relabel, skim, export — holds for almost any interview or meeting WAV. The only thing that changes with a much longer or larger file is whether you compress it first.
Handling large WAV files
A 44.1 kHz / 16-bit stereo WAV runs about 10 MB per minute. A two-hour session is around 1.2 GB, and a long multi-track export can be larger. Two things are worth planning for.
The 2 GB limit. If a file is over 2 GB, shrink it before uploading. For speech, the simplest route is to convert it to MP3 — the quality cost is negligible — with the free in-browser WAV to MP3 converter; nothing uploads during the conversion itself. To stay bit-for-bit lossless instead, export the file to FLAC from an audio editor (identical quality to WAV, and typically 40–60% smaller).
Upload time. A 1 GB WAV on a 10 Mbps connection takes roughly fifteen minutes just to upload. On a slow link, compressing before upload cuts that wait significantly. If you do not need bit-for-bit lossless, the compress-audio tool shrinks the file further with a small, usually inaudible quality cost — for speech that is rarely a problem.
Stereo carrying mono content. Plenty of recordings are saved as stereo but actually contain the same audio on both channels — a single mic duplicated across left and right. Converting that to a mono WAV or FLAC halves the file size with no accuracy loss, since the engine mixes to mono anyway. It is the single easiest size win when it applies.
Pre-processing for difficult recordings
Two quick edits in an audio editor pay off before you upload a borderline file:
- Normalize quiet audio. If the peaks do not reach above roughly -12 dBFS in a waveform viewer, the recording is too quiet and accuracy suffers. Normalizing lifts the level and usually improves the transcript noticeably.
- High-pass a humming room. Consistent low-frequency noise — HVAC, a laptop fan, traffic rumble — confuses the engine. A high-pass filter at 80 Hz removes the bass hum and leaves speech untouched. Most editors offer it as a one-click filter.
Neither step is required for a clean recording. They are the targeted fixes for the recordings that need help, and re-exporting to WAV after either edit keeps the audio lossless.
WAV vs MP3 for transcription: does lossless actually help?
This is the question behind most WAV decisions, so here is the plain version.
For accuracy, lossless helps a little, and it helps more the worse the source. On a clean recording the difference between a WAV and a good 128 kbps MP3 is hard to detect in the transcript. On a quiet or noisy recording, the WAV’s extra fidelity gives the engine slightly more to work with. Either way, the format is rarely the thing standing between you and a usable transcript — recording quality is.
For workflow, MP3 wins on size and upload speed, and WAV wins as a working master you keep between edits. A common middle path: keep your edited master as WAV or FLAC, and if you need to upload over a slow connection, send a high-bitrate MP3 — you lose almost nothing in the transcript and save a lot of upload time. The MP3-to-text guide covers that path in full.
The short version: record and archive in WAV if quality matters, but do not feel obliged to upload a 1 GB file when a 100 MB one transcribes just as well.
Accuracy tips for WAV
- Capture the speech band cleanly. Anything above 16 kHz is wasted on transcription, but a clean 300 Hz–8 kHz is everything. A decent mic close to the speaker beats a high sample rate every time.
- Mic per speaker where you can. Two people on one shared mic are harder to separate than two on their own tracks. If your field recorder supports it, record each speaker to a channel.
- Keep one language per file. Detection runs per file; a recording that switches languages mid-way is harder than two clean single-language files would be.
- Skim once after export. Proper nouns — names, places, brands, technical terms — are where any engine guesses. A single find-and-replace pass cleans most of them in a couple of minutes.
Speaker labels and timestamps
Every WAV transcript includes automatic speaker separation — a free benefit on every job, not a paid tier or a per-speaker upsell. A two-person recording, such as an interviewer and a subject or a two-sided call captured to a stereo track, comes out cleanly split. Larger group recordings — four or more people around a conference table — are harder for any diarization engine, so expect to do some label cleanup on a noisy room recording.
The SRT export is especially useful for WAV files that came from video production. If you recorded audio separately — a boom mic or a lav track synced in post — the SRT timestamps let you re-line the transcript to the picture without scrubbing through the audio by hand.
Export options
| Format | Notes |
|---|---|
| TXT | Plain transcript with speaker labels, no timestamps. Good for search, quoting, and feeding into another tool. |
| SRT | Timestamps on every utterance. Standard subtitle and caption format — opens in VLC, video editors, and caption tools. |
| DOCX | Word-compatible, with speaker labels and timestamps in a readable layout for sharing or annotating. |
| JSON | Structured output — { speaker, start, end, text } per utterance — for custom processing pipelines. |
No watermark on any format, and exports are included with every transcript. There is nothing gated behind a higher tier.
If your WAV is a borderline recording and accuracy is the priority, the MP3-to-text guide covers normalization and other pre-processing steps that apply equally here. If you would rather convert it to MP3 and transcribe that, the free converter runs in your browser without uploading the file. And for the full list of accepted formats and features in one place, the audio-to-text page has everything.