โ† Back
๐ŸŽ™๏ธSecurity

Voice AI Systems Are Vulnerable to Hidden Audio Attacks

Michael Sintim-Koree ยท May 2026

Voice AI is everywhere now. Call center automation, smart speakers, car infotainment, hospital dictation systems, building access control. Organizations are deploying voice-driven interfaces faster than they're thinking through what happens when someone decides to attack them.

The attack class I want to walk through is adversarial audio: sounds specifically crafted to manipulate a speech recognition or voice command system, often in ways a human listener cannot detect. This is not theoretical. The research has been public for years, and the gap between academic demonstrations and practical exploitation keeps closing.


How speech recognition actually works

Before getting into the attacks, it helps to understand the pipeline they're targeting. Modern automatic speech recognition systems โ€” the ones powering Alexa, Google Assistant, Whisper, Azure Speech โ€” almost all use deep neural networks operating on spectral representations of audio rather than raw waveforms.

An input audio clip gets converted into a mel-frequency spectrogram: a 2D representation of how energy is distributed across frequency bands over time, weighted to match how human hearing perceives pitch. That spectrogram feeds into a neural network, typically an encoder-decoder architecture with attention, which outputs a probability distribution over tokens at each time step. A language model layer then constrains the output to plausible sequences. The final transcript is the highest-probability token sequence across the full input.

The critical detail: the model is operating on the spectrogram, not on what the audio sounds like to a human. Human hearing and mel-spectrogram analysis share a lot of overlap โ€” that's by design โ€” but they are not identical. The gap between them is exactly where adversarial attacks live.


Four ways the attack surface actually works

Adversarial perturbations: inaudible noise, wrong transcript

The oldest and most studied variant. Take an audio clip โ€” any audio clip, music, ambient noise, a spoken sentence โ€” and add a carefully computed noise signal to it. The added noise is inaudible to a human: the perturbation stays below the threshold of human perception, typically a few decibels in frequency bands where the ear is least sensitive. The modified audio sounds identical to the original. The speech recognition model transcribes it as something else entirely.

The 2018 Carlini and Wagner paper demonstrated this against DeepSpeech: starting from any arbitrary audio waveform, they could produce a perturbation that caused the model to output any target transcription, achieving a 100% success rate in their white-box evaluation. The attack required white-box access to the model โ€” knowing the weights and gradients โ€” which limited practical relevance at the time. Subsequent work on transferability and black-box variants has narrowed that limitation considerably. That narrowing is what makes the Carlini and Wagner result matter now in a way it didn't five years ago.

Hidden voice commands: audio that sounds like noise, acts like an instruction

Rather than perturbing existing audio, these attacks craft audio that is inherently unintelligible to humans but that a voice recognition system interprets as a valid command. The 2016 Hidden Voice Commands paper from UC Berkeley showed this was possible against Google Now and Apple Siri using obfuscated audio that sounded like noise or static. The practical scenario: an attacker embeds a hidden command in audio playing in a public space, a video, a podcast ad, or a website's background audio. The user's smart speaker or phone hears it and executes. The user hears nothing that sounds like a command. This requires no sophisticated hardware, no physical access.

Ultrasonic attacks: commands injected above the range of human hearing

DolphinAttack, published in 2017 and extended substantially since, takes the hiding one step further. Voice commands are modulated onto an ultrasonic carrier โ€” frequencies above 20kHz that human hearing simply cannot perceive. When that ultrasonic signal hits a microphone, the hardware nonlinearity of the microphone circuits demodulates it, recovering an audible baseband signal that the speech recognition model sees as normal voice input. The attack works against commodity microphones because the nonlinearity is a hardware property, not a software flaw. Original research demonstrated the attack at distances up to roughly one meter in controlled conditions, though practical range varies by hardware setup.

SurfingAttack, a 2020 paper presented at NDSS, showed that ultrasonic commands can propagate through solid surfaces โ€” a phone lying on a table can receive commands injected through the table itself via a hidden piezoelectric transducer underneath. That result reframes the threat model: the attack surface isn't just airborne audio.

Over-the-network attacks: surviving compression and transmission

Most demonstrations assume the attacker and the target device are in the same physical space. Several more recent papers have shown that adversarial audio can survive compression and transmission through phone networks, VoIP, and streaming pipelines. The perturbation degrades through lossy compression, so attacks designed for studio-quality audio don't always transfer. But attack optimization targeting the specific codec in use โ€” MP3 compression characteristics, Opus codec behavior โ€” can produce perturbations that survive. A voice authentication system behind a phone interface is potentially reachable from anywhere.


Who is actually at risk

Consumer smart speakers get the press coverage. They're not where the serious risk is.

Voice-authenticated systems in financial services are the category that warrants the most concern. Several banks use voice biometrics for customer authentication over the phone โ€” voiceprints compared against a stored model. Voice cloning tools are now good enough to produce convincing audio from a short reference sample, and that's a separate but related problem. The adversarial angle compounds it: an attacker who can manipulate the recognition model's interpretation doesn't even need a perfect clone. They need audio that scores above the threshold.

Healthcare dictation is another high-stakes context. Physicians dictating notes into an AI transcription system that writes directly into an EHR is an obvious target. Manipulating a dosage, altering a diagnosis code, inserting a medication name into a transcription โ€” these are not hypothetical attack goals. There are no widely documented clinical incidents, but the attack surface is real and a mistranscription in a medication order doesn't need to be dramatic to be dangerous.

Building access and physical security systems running voice authentication add a physical dimension to what was otherwise a data integrity problem. Voice-controlled locks, elevator systems, secure facility intercoms โ€” all have been targeted in research contexts. The gap between research and real-world exploitation tends to be smaller than vendors admit.


Why defense is genuinely hard

Adversarial robustness: still unsolved after a decade of trying

Neural network robustness against adversarial examples is an active research area and nobody has a complete solution. Adversarial training โ€” augmenting training data with adversarial examples โ€” improves robustness against known attack types but degrades performance on benign inputs. The tradeoff doesn't fully resolve. Certified defenses that provide formal guarantees against perturbations within a specific bound exist for image classification, but the speech domain is harder because the perturbation space and the perceptual distance metric are both more complex. An L-infinity bound in pixel space has a rough human perceptual analog. The equivalent in spectrogram space is less clear, which makes defining what counts as an imperceptible perturbation both technically harder and harder to reason about for standards and compliance purposes.

Microphone hardware nonlinearity: no patch for physics

The ultrasonic attack vector works because of physical properties of microphone circuits. Low-pass or bandpass filtering at the hardware or signal processing level can reduce susceptibility โ€” filtering out ultrasonic frequencies before the nonlinearity stage can block the carrier. Some newer devices incorporate this. Most deployed hardware doesn't, and it won't be retrofitted.

Third-party models: inheriting someone else's security posture

Organizations deploying voice AI are usually integrating a third-party model via API or a packaged SDK. They don't control the model architecture, the training data, or the inference pipeline. When a vulnerability is found in the underlying model, the organization's options are: wait for the vendor to patch and update, or stop using the system. The security posture of the voice AI stack is largely inherited, not owned. That constraint โ€” not the attack techniques themselves โ€” is what will derail most serious security programs here. Teams can do everything right on their side of the integration and still be exposed.


What actually helps

Liveness detection and multi-modal verification

Voice authentication systems that add liveness detection โ€” verifying that the audio is coming from a live human in real time, not a replay or a generated signal โ€” are meaningfully harder to attack. Challenge-response liveness (say this specific phrase now) helps against replay and cloning. Acoustic liveness detection that analyzes microphone characteristics and room acoustics can flag playback from a speaker. Neither is bulletproof, but both raise the cost of attack.

Multi-factor authentication that doesn't rely solely on voice is the cleaner answer for anything high-stakes. Voice biometrics as one factor alongside device binding, a PIN, or a push notification to a registered device limits the blast radius of a compromised voice channel substantially.

Input validation and anomaly detection

Voice command systems that execute high-privilege actions benefit from a confirmation step the adversarial audio can't also manipulate. A smart home system that reads back a command and requires a spoken confirmation before executing doubles the attack complexity. For systems with a human operator in the loop โ€” call center AI that routes calls before a human picks up โ€” anomaly detection on transcription confidence scores can flag inputs generating unusually high uncertainty in the model, which is often a signature of adversarial examples.

Physical environment controls

For systems where the physical environment can be controlled: directional microphone arrays configured to reject audio from directions other than the expected speaker, acoustic shielding for high-security voice-authenticated rooms, bandpass filtering to reject ultrasonic frequencies before they reach the speech recognition pipeline. This is standard audio engineering applied to a security context. The reason it gets overlooked is that voice AI deployments are usually managed by software teams who don't think in terms of microphone signal chains.

Vendor transparency and model provenance

When evaluating voice AI vendors, ask specific questions about adversarial robustness testing. What attack classes are they testing against? Is there published evaluation against known adversarial audio benchmarks? How quickly is the model updated when new attack techniques are published? Vendors who haven't thought about this will tell you. That's useful information before deploying something in a clinical or financial context.


The deployment decisions that matter most

The risk profile of voice AI is directly tied to what actions the system can take. A voice interface that controls a playlist carries different stakes than one that authenticates a wire transfer or unlocks a server room door. That's obvious, but organizations don't always follow it through to the architecture.

High-consequence actions should not be voice-only. Voice-controlled systems that execute anything with significant downstream effects โ€” financial transactions, access control, medical record writes โ€” need a secondary confirmation channel that is out-of-band from the voice input. If voice is the only authentication factor and the only confirmation mechanism, any voice channel attack is a full compromise. The implementation varies; the principle doesn't.

Least privilege applies here the same way it applies to service accounts and API tokens. A voice assistant with access to everything an authenticated user can do is a much larger target than one scoped to read-only operations or a specific set of approved commands. Constrain what the voice channel can actually authorize, and the adversarial audio attack surface shrinks proportionally.


Voice AI is being deployed into regulated and high-stakes contexts at a pace that outstrips both the security research community's ability to characterize risks and vendors' willingness to communicate them clearly. That's not a new pattern in tech, but it's one that tends to produce bad outcomes discovered in production rather than in a test environment.

The attacks are real. The defenses are incomplete. Deploy voice AI with a clear view of what the attack surface actually is โ€” not treating the voice channel as implicitly trustworthy because it sounds like a human interaction.


If you're running voice authentication in a financial or clinical environment and your vendor hasn't given you a straight answer on adversarial robustness testing โ€” I'd genuinely like to hear how that conversation went.