Does the MP4 video file get uploaded to the server?

No. Your browser extracts the audio track from the video before anything leaves your device. Only the audio is sent for transcription — the original MP4 stays on your machine. You can confirm it yourself in the browser's network tab: you'll see the small audio upload, not the multi-gigabyte video.

What's the maximum video size I can use?

Up to 2 GB or 10 hours per file. Because the audio is extracted in your browser first, the actual upload is just the audio track — typically 10–20x smaller than the original video — so a long recording uploads faster than its file size suggests.

How much does it cost to transcribe an MP4?

You get 30 free minutes after a $1 card check (the hold is authorized, then released right away, never charged), or with your first purchase if you pay another way. After that it's pay-as-you-go, billed against the audio's duration, not the video's file size. See the pricing page for the pack costs.

Can I get SRT subtitles from an MP4?

Yes. After transcription, export as SRT. The SRT file includes a start and end timestamp at every utterance and can be sideloaded into most video players or uploaded as a caption track. It keeps speaker labels, which some other caption formats drop.

What other video formats work the same way?

MOV, WebM, MKV, and most container formats that carry a standard audio codec (AAC, MP3, Opus, FLAC). The in-browser extraction step handles the demuxing, so the method is identical to MP4.

My video has no spoken audio — can I still transcribe it?

No. Transcription works from speech, so a video with no audio track, or a silent screen recording, has nothing to transcribe. If there's background music but no speech, the result will be empty or noise — the engine only returns words.

Is the audio deleted after transcription?

Yes. The extracted audio is deleted from the server the moment your transcript is ready, and the speech engine retains nothing. The transcript itself is encrypted at rest, so even a storage leak would expose unreadable ciphertext rather than your words.

How to Convert an MP4 Video to Text, Privately

May 1, 2026

When you transcribe an MP4 video with most tools, the whole file uploads — video, audio, metadata, all of it. A one-hour 1080p recording can be 4–8 GB. That’s slow on any connection, and it means your footage now lives on a server you don’t control.

Hushscript takes a different path. Your browser extracts the audio from the video before the upload starts, so the original MP4 stays on your device. Only the audio is transcribed, and it’s deleted the moment your transcript is ready. This guide walks through the full process — the steps, a worked example on a recorded webinar, how to pull SRT subtitles out, the other video formats that work the same way, and what to do when something goes wrong.

Why your video shouldn’t have to upload

The video track is irrelevant to transcription. The speech engine only needs the audio. Uploading the picture is pure waste — network time, server storage, and a privacy footprint you don’t need.

How in-browser extraction works, in plain English

An MP4 is a container. Inside it sit separate streams: the video (the picture), the audio (the sound), and some metadata. Transcription needs only the audio stream.

Hushscript runs a small audio toolkit (ffmpeg, compiled to run inside the browser) right on the page. When you drop your MP4, that toolkit opens the container locally, copies out the audio stream, and hands that audio to the upload — without the page ever sending the video anywhere. The work happens in the same place a video plays: on your own machine.

So the practical result is:

The MP4 file never leaves your device. The picture stays with you.
The upload size is the size of the audio, not the video — usually 10–20x smaller.
The server, and the speech engine behind it, never have access to your video content.

This matters for meetings, recorded interviews, legal depositions, and anything where the footage itself is sensitive. It also matters when you’re on a slow connection with a multi-gigabyte file: extracting the audio first turns a half-hour upload into a couple of minutes.

What you need

Three things, and nothing to install:

An MP4 with spoken audio. A recorded meeting, webinar, interview, lecture, or talking-head video — anything where people are speaking. A silent clip or a music-only video has nothing to transcribe.
A current browser. Chrome, Edge, Firefox, or Safari, reasonably up to date. The audio extraction runs in the browser, so an older or stripped-down browser may not support it.
A Hushscript account. You get 30 free minutes after a $1 card check — the hold is authorized, then released right away, never charged — or with your first purchase if you use another payment method available in your country. A card isn’t required; it’s just the fastest route to the free minutes. After that it’s pay-as-you-go, with no subscription. Costs are on the pricing page.

You don’t need a separate audio extractor, a converter, or any video software. Dropping the MP4 is the whole input.

Convert an MP4 in three steps

Drop your MP4 on the page at /mp4-to-text. Your browser extracts the audio track and uploads only the audio — the video itself stays on your device. The first 30 seconds come back as a speaker-labeled preview, no account needed.
Sign up to transcribe the rest. Enter your email to create an account. New here? The 30 free minutes unlock after a $1 card check (authorized, then released right away, never charged), or with your first minute pack if you pay another way — a card isn’t required. Transcription is billed against the audio duration, not the file size.
Export the transcript. Once it’s ready, download as TXT, SRT, DOCX, or JSON. Speaker labels are included free.

The audio deletion happens automatically when the transcript is delivered. There’s no cleanup step to remember.

What “billed against audio duration” means

A one-hour MP4 at 1080p might be 6 GB. The audio extracted from it — typically AAC at 128–256 kbps — is around 60–120 MB. That’s what uploads, and the minute cost is based on the one hour of audio, not the 6 GB file.

The upshot: you can transcribe a two-hour 4K documentary from a field recorder and pay the same minute cost as a small MP3, because both carry the same amount of audio. Resolution, bitrate, and file size are completely irrelevant to what you’re charged.

A worked example: a recorded webinar

Say you ran a 52-minute webinar with two presenters and recorded it to a single MP4 — q3-product-webinar.mp4, about 3.4 GB at 1080p. You want a clean transcript for the recap email, plus subtitles to attach to the replay.

Here’s how it actually plays out:

You open /mp4-to-text and drop q3-product-webinar.mp4. The page extracts the audio locally — that 3.4 GB video becomes roughly 70 MB of audio. Only those 70 MB upload.
The first 30 seconds come back labeled, so you can sanity-check the audio quality before transcribing the full 52 minutes.
A few minutes later the full transcript is ready. It’s split into utterances, each tagged Speaker A or Speaker B. You click Speaker A, type “Priya,” click Speaker B, type “Marcus” — both relabel everywhere at once.
You export TXT for the recap email and SRT for the replay’s caption track. The webinar’s original MP4 never left your laptop, and the audio is already deleted from the server.

The part that surprises people the first time is step 1: a multi-gigabyte upload that simply doesn’t happen. The picture stays with you; only the words travel.

Speaker labels for multi-person videos

Every transcript includes automatic speaker diarization at no extra charge. For an interview, a panel, or a meeting recording, each speaker is labeled Speaker A, Speaker B, and so on, and you can rename them to real names in the editor by clicking the label and typing.

Diarization works best with clean separation. Camera audio in a quiet room, or a video call recorded with per-participant tracks, gives the engine the most to work with. A crowded room with people talking over each other is harder for any diarization engine — see the troubleshooting section below for what helps.

Get subtitles (SRT / VTT) from the video

If the reason you’re transcribing the video is captions, export the transcript as SRT. Every utterance arrives with a start and end timestamp, formatted to the SRT spec:

1
00:00:04,210 --> 00:00:07,880
Speaker A: Thanks for joining us today.

2
00:00:08,100 --> 00:00:12,340
Speaker B: Happy to be here.

Sideload the SRT into a desktop player (VLC, QuickTime, and most editors accept it), or upload it as a caption track on a platform that supports one. The Hushscript SRT export keeps speaker labels, which is useful for multi-person video — some caption formats strip the names, SRT preserves them.

For web players that use WebVTT (.vtt), the difference from SRT is tiny: a WEBVTT header line and a . instead of , in the timestamps. You can convert an SRT to VTT in the browser with the free tool at /tools/srt-to-vtt — nothing uploads.

Two sibling guides go deeper than there’s room for here: how to create SRT subtitles from a video covers editing cue timing and line length, and how to add subtitles to a video covers attaching the SRT or burning it in. Burning subtitles permanently into the picture is a separate encoding step — a free tool like HandBrake handles the mux once you have the SRT.

Troubleshooting

Most MP4s transcribe without a hitch. When one doesn’t, it’s almost always one of these.

The video is very large (over 2 GB). The per-file limit is 2 GB or 10 hours, and that limit is on the audio, not the video — but a few cases trip on it. If the extraction stalls on a huge 4K file, your browser may be low on memory; close other tabs and try again, or convert the video to audio first. The free extract-audio tools at /tools pull the audio out in your browser and produce a small file you can upload directly. A 10-hour cap on audio is generous: it covers almost any single meeting, webinar, or lecture.

No audio track, or no speech. A silent screen recording or a music-only clip has nothing for the engine to transcribe — you’ll get an empty result. Confirm the video actually has spoken audio by playing it with the volume up. If it’s a screen recording, check that the recorder was set to capture system or microphone audio; many default to video only.

An unsupported or obscure audio codec. The in-browser extractor handles the common codecs (AAC, MP3, Opus, FLAC). A container using an unusual or proprietary audio codec may not demux cleanly, and the audio won’t extract. The fix is to remux or re-encode the video’s audio to a standard codec first — most converters can output a standard MP4 or a plain audio file. If you hit a format that won’t go through, email support@hushscript.com and we’ll look at adding it.

Multiple speakers, messy separation. Diarization needs distinguishable voices. Overlapping speech, a single far-field mic in a large room, or heavy background noise all blur the boundaries between speakers. You’ll still get an accurate transcript; the speaker labels are just less precise. If you control the recording, a per-participant track (as most video-call tools can export) or a mic closer to each speaker gives the cleanest separation. Either way, you can correct any mislabeled lines in the editor before exporting.

The transcript has consistent misspellings. A recurring proper noun or piece of jargon that’s always transcribed the same wrong way is a one-time fix: a single find-and-replace in the exported text corrects every instance. This is faster than re-running anything.

Export formats at a glance

Format	What you get
TXT	Plain text, speaker labels, no timestamps
SRT	Timestamped captions, ready to sideload into any video player
DOCX	Speaker-labeled transcript for Word or Google Docs
JSON	Structured data: `{ speaker, start, end, text }` per utterance

All four are included with every plan, including the 30 free minutes, and none carries a watermark.

For any video where privacy matters — a meeting, an interview, a deposition — the point holds: your video never uploads. You can verify it yourself. Open the browser’s network tab before you drop the file and watch what’s sent; you’ll see the audio go up, and the picture stay put.