Convert M4A / Apple Voice Memo to Text, Privately
May 5, 2026
M4A is the default format for Apple Voice Memos, and it is what you get when you export from GarageBand, record audio with QuickTime, or save a voice note in many apps on the App Store. Turning one into text is quick. The part that deserves more attention is privacy, because the things people record on a phone tend to be more personal than the things they type.
This guide covers where M4A files come from, how to convert one to text with speaker labels, how to keep a sensitive recording private, and how to handle the awkward cases: two-sided calls, quiet memos, and accents.
Where M4A files come from (iPhone and Mac)
M4A is MPEG-4 audio, almost always AAC-encoded. It is Apple’s house format, so you meet it constantly across the ecosystem and a few places outside it:
- Apple Voice Memos save every recording as
.m4a, on both iPhone and Mac. - iPhone call recorders — third-party apps that capture a phone or FaceTime call usually write M4A.
- QuickTime audio recordings on a Mac save as M4A.
- WhatsApp and Signal voice notes export as OGG or M4A, depending on the app and platform.
- GarageBand exports are M4A or AAC, which is the same codec in a different wrapper.
So the file in front of you might be a quick note to self, an interview, a recorded call, or a forwarded voice message. The conversion is identical for all of them. What changes is how careful you want to be with the contents, which is the thread running through the rest of this guide.
Getting a Voice Memo off your iPhone
Open the Voice Memos app, tap the recording you want, tap the three-dot menu (…), and choose Share. From there you can AirDrop it to your Mac, or send it to yourself by Mail, Messages, or Notes. However you send it, it lands as a .m4a file.
On a Mac, Voice Memos keeps its recordings in ~/Library/Application Support/com.apple.voicememos/Recordings/. If you would rather skip the share sheet, you can copy the .m4a files straight out of that folder.
What you need
There is very little to set up:
- The
.m4afile, on whatever device has a browser. - A modern browser. Nothing to install on the iPhone or the Mac — transcription runs in the browser and in the cloud.
- An account, but only when you are ready to transcribe the whole file. The 30-second preview needs no account.
A note on cost, kept short: Hushscript is not free, because it runs on top-tier transcription AI that has a real per-minute cost. It is pay-as-you-go with no subscription, and you get 30 free minutes to try it. Those minutes arrive instantly when you validate a card with a $1 hold that is authorized and then released right away, never charged. If you would rather pay another way, the 30 minutes are granted once after your first minutes purchase, and a card is not required. Exact pack prices are on the pricing page.
Convert an M4A in three steps
The flow is built so you can hear the quality before you commit to anything.
- Drop the file and watch the preview. Go to voice recording to text and drag your
.m4aonto the upload area, or click to browse. Hushscript transcribes the first 30 seconds and shows it back to you with speaker labels — no account, no payment, nothing to fill in. This preview is where you check that the audio is clear enough and that the speakers are being separated the way you expect. - Sign up to transcribe the rest. If the preview looks right, enter your email to create an account. This is the step that enables full transcription, and it is also where your 30 free minutes come from. Files up to 10 hours or 2 GB are accepted, with no daily caps.
- Get, relabel, and export the transcript. Upload the full file and let it process. When it is done, the transcript appears in your dashboard with speakers marked. Rename “Speaker 1” to a real name in a click, then export as TXT, SRT, DOCX, or JSON.
The moment your transcript is ready, the audio is deleted from the server. There is no setting to retain it and it is never used to train anything. You keep the transcript; we do not keep the recording.
How long it takes
Most Voice Memos are short — a thought captured on the way somewhere, a couple of minutes of notes — and those transcribe in well under 30 seconds. Longer recordings scale roughly with their length. A 30-minute memo of meeting notes finishes in a minute or two; a full hour takes a few minutes. You do not have to sit on the page while it runs. The transcript lands in your dashboard and you can come back to it.
Which languages work
Hushscript transcribes around 99 languages, detected automatically. A memo you recorded in Spanish, French, or Japanese transcribes without you setting anything — there is no language dropdown to get wrong. Accuracy is strongest in the most common languages and still solid in the long tail.
A worked example: a recorded interview
Say you recorded a 25-minute interview on your iPhone — you and one other person, both audible, saved by Voice Memos as interview-2026-06.m4a. You AirDrop it to your Mac, drop it onto the preview, and the first 30 seconds come back labeled. You rename the speakers, transcribe the full file, and a couple of minutes later export a TXT that reads like this:
Speaker A 00:00:04 Thanks for making the time. Can we start with how
the project actually began?
Speaker B 00:00:11 Sure. It started as a side experiment, honestly. We
didn't expect it to turn into the main thing.
Speaker A 00:00:19 And when did that shift happen?
Speaker B 00:00:23 Around the second prototype. That's when people
outside the team started using it daily.
Each turn carries a speaker tag and a timestamp, so quoting someone accurately is a matter of finding the line, not scrubbing back and forth through audio. If you export SRT instead, you get the same content cut into timed caption blocks, which is what you want if you ever pair the transcript with a video or audio player.
Keep sensitive memos private
Voice Memos tend to hold things you would never put in an email: interview notes, a doctor’s observations, minutes from a closed meeting, a private conversation you wanted to remember exactly. The contents are often more sensitive than a document, which is why how a tool treats the file matters here more than it would for, say, a podcast episode you are about to publish anyway.
Two things protect the recording with Hushscript. First, the audio is deleted the moment the transcript is ready, so there is no lingering copy sitting in storage. Second, the transcript that remains is encrypted at rest. If our storage were ever leaked, the contents would surface as unreadable ciphertext rather than your words. That is leak protection, stated plainly — it is not a claim that no one on our side can ever read a transcript, because the key is held on the server, not by you alone. The honest version is the useful one: a breach would expose ciphertext, and you can delete any transcript yourself in one click.
If you are converting a legal consultation, a patient note, or a confidential meeting, that combination — delete the audio, encrypt what is left, let you erase it — is the reason to choose an upload-based tool that discards rather than one that quietly keeps your files. The fuller explanation of how the whole pipeline handles your audio is at audio to text.
Two-sided call recordings
A lot of M4A files are recorded calls, and calls are where speaker separation earns its keep. If both parties are audible in the file, Hushscript separates them automatically — Speaker A and Speaker B map to the two voices — so you can read who said what instead of untangling one merged block.
A few things shape how well that works:
- Both sides in one track. This is what most call recorder apps produce: a stereo or mono mix carrying both parties. Speaker separation works cleanly on it.
- One-sided recordings. Some iOS call recorders only capture the device microphone, so only your side is on the file. The transcript will be accurate, but there is just one speaker to label.
- Call-audio quirks. Calls often carry compression artifacts, background noise, and the volume ducking that VoIP codecs apply when both people talk at once. Accuracy is usually a little lower than a face-to-face recording, so plan to skim proper nouns and any quiet stretches.
For the full walk-through of recording, consent, and cleanup specific to calls, see how to transcribe a phone call. For more on how the speaker tags themselves are produced, speaker identification goes deeper.
Troubleshooting common issues
Most M4A files transcribe without any fuss. When the result is not what you hoped, the cause is almost always one of these, and most are fixable.
The recording is too quiet
A memo recorded with the phone in a pocket or across a table comes out faint, and faint audio means more guessed words. The fix that helps most is at the source: hold the phone closer to whoever is speaking next time, and keep it off soft surfaces that muffle it. For a recording you already have, the transcript will still be largely right — read it against the audio for the quiet passages rather than trusting it blind.
Strong accents or fast speech
Accents are handled well, but heavy accents combined with fast, overlapping speech are the hardest case for any transcription engine. Expect the occasional swapped word or run-on. Names and technical terms are the usual casualties, so those are the first thing to proofread.
Two people talking at once
When speakers overlap, the boundary between them blurs and a few words can land under the wrong speaker tag. There is no perfect fix for crosstalk in a recording that already exists. If you control the recording, asking people not to talk over each other does more for the final transcript than any setting.
The file is large or long
Long memos are fine — up to 10 hours or 2 GB per file, with no daily limit — but a big file takes longer to process and you do not need to wait on the page. Start it, leave, and collect the transcript from your dashboard later. If a file refuses to upload, it is usually a flaky connection rather than the file itself; retrying on a stable network normally clears it.
An unusual Apple format
If you have something other than plain M4A — a CAF file, an AIFF, the audio inside an MP4 — you do not need to convert it first. Drop it as-is. If a format genuinely is not accepted, email support@hushscript.com and we will add it.
Preview first, or just transcribe?
The 30-second preview is free and needs no account, so the simple rule is: when the audio quality is in any doubt — a recorded call, something captured at a distance, a noisy room — preview first. You will see in 30 seconds whether the speakers separate and whether the words are landing, before you spend any of your minutes. For a clean memo you recorded close to the mic, you can reasonably skip straight to signing up and transcribing the whole thing.
Accuracy tips
A few habits raise the quality of the transcript more than any single setting:
- Record close to the speaker. Distance is the biggest single factor. A phone near the person who is talking beats an expensive setup across the room.
- Cut obvious noise at the source. Fans, traffic, and a TV in the background all cost you words. Recording away from them helps more than anything you can do afterward.
- Let the preview vet the file. It is the cheapest accuracy check you have — free, no account, 30 seconds.
- Proofread proper nouns first. Names, places, and jargon are where errors cluster. Skim those before you trust the rest.
Wrapping up
M4A is M4A whatever created it, so the steps here work the same for an Android voice recorder or a forwarded WhatsApp note as they do for an Apple Voice Memo. Drop the file, check the 30-second preview, sign up to transcribe the rest, and export a speaker-labeled transcript — while the audio gets deleted and the transcript stays encrypted at rest.
If you are working through a pile of recordings in different formats, how to transcribe any audio file lays out the full format list and the best approach for each.