Publications | Malek Itani

2024

Hearable devices with sound bubbles

Tuochao Chen, Malek Itani, Sefik Emre Eskimez, and 2 more authors

Nature Electronics, 2024

Abs Website

The human auditory system has a limited ability to perceive distance and distinguish speakers in crowded settings. A headset technology that can create a sound bubble in which all speakers within the bubble are audible but speakers and noise outside the bubble are suppressed could augment human hearing. However, developing such technology is challenging. Here, we report an intelligent headset system capable of creating sound bubbles. The system is based on real-time neural networks that use acoustic data from up to six microphones integrated into noise-cancelling headsets and are run on the device, processing 8 ms audio chunks in 6.36 ms on an embedded central processing unit. Our neural networks can generate sound bubbles with programmable radii between 1 m and 2 m, and with output signals that reduce the intensity of sounds outside the bubble by 49 dB. With previously unseen environments and wearers, our system can focus on up to two speakers within the bubble, with one to two interfering speakers and noise outside the bubble.
Target conversation extraction: Source separation using turn-taking dynamics

Tuochao Chen, Qirui Wang, Bohan Wu, and 4 more authors

In Interspeech, 2024

Abs PDF Website

Extracting the speech of participants in a conversation amidst interfering speakers and noise presents a challenging problem. In this paper, we introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants. To accomplish this, we propose leveraging temporal patterns inherent in human conversations, particularly turn-taking dynamics, which uniquely characterize speakers engaged in conversation and distinguish them from interfering speakers and noise. Using neural networks, we show the feasibility of our approach on English and Mandarin conversation datasets. In the presence of interfering speakers, our results show an 8.19 dB improvement in signal-to-noise ratio for 2-speaker conversations and a 7.92 dB improvement for 2-4-speaker conversations. Code, dataset available at this https URL.
Knowledge boosting during low-latency inference

Vidya Srinivas, Malek Itani, Tuochao Chen, and 3 more authors

In Interspeech, 2024

Abs PDF Website

Models for low-latency, streaming applications could benefit from the knowledge capacity of larger models, but edge devices cannot run these models due to resource constraints. A possible solution is to transfer hints during inference from a large model running remotely to a small model running on-device. However, this incurs a communication delay that breaks real-time requirements and does not guarantee that both models will operate on the same data at the same time. We propose knowledge boosting, a novel technique that allows a large model to operate on time-delayed input during inference, while still boosting small model performance. Using a streaming neural network that processes 8 ms chunks, we evaluate different speech separation and enhancement tasks with communication delays of up to six chunks or 48 ms. Our results show larger gains where the performance gap between the small and large models is wide, demonstrating a promising method for large-small model collaboration for low-latency applications. Code, dataset, and audio samples available at this https URL.
Look Once to Hear: Target Speech Hearing with Noisy Examples

Bandhav Veluri, Malek Itani, Tuochao Chen, and 2 more authors

In Proceedings of the CHI Conference on Human Factors in Computing Systems, 2024

Abs PDF Website

In crowded settings, the human brain can focus on speech from a target speaker, given prior knowledge of how they sound. We introduce a novel intelligent hearable system that achieves this capability, enabling target speech hearing to ignore all interfering speech and noise, but the target speaker. A naı̈ve approach is to require a clean speech example to enroll the target speaker. This is however not well aligned with the hearable application domain since obtaining a clean example is challenging in real world scenarios, creating a unique user interface problem. We present the first enrollment interface where the wearer looks at the target speaker for a few seconds to capture a single, short, highly noisy, binaural example of the target speaker. This noisy example is used for enrollment and subsequent speech extraction in the presence of interfering speakers and noise. Our system achieves a signal quality improvement of 7.01 dB using less than 5 seconds of noisy enrollment audio and can process 8 ms of audio chunks in 6.24 ms on an embedded CPU. Our user studies demonstrate generalization to real-world static and mobile speakers in previously unseen indoor and outdoor multipath environments. Finally, our enrollment interface for noisy examples does not cause performance degradation compared to clean examples, while being convenient and user-friendly. Taking a step back, this paper takes an important step towards enhancing the human auditory perception with artificial intelligence.

2023

Semantic Hearing: Programming Acoustic Scenes with Binaural Hearables

Bandhav Veluri, Malek Itani, Justin Chan, and 2 more authors

In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023

Abs PDF Website

Imagine being able to listen to the birds chirping in a park without hearing the chatter from other hikers, or being able to block out traffic noise on a busy street while still being able to hear emergency sirens and car honks. We introduce semantic hearing, a novel capability for hearable devices that enables them to, in real-time, focus on, or ignore, specific sounds from real-world environments, while also preserving the spatial cues. To achieve this, we make two technical contributions: 1) we present the first neural network that can achieve binaural target sound extraction in the presence of interfering sounds and background noise, and 2) we design a training methodology that allows our system to generalize to real-world use. Results show that our system can operate with 20 sound classes and that our transformer-based network has a runtime of 6.56 ms on a connected smartphone. In-the-wild evaluation with participants in previously unseen indoor and outdoor scenarios shows that our proof-of-concept system can extract the target sounds and generalize to preserve the spatial cues in its binaural output. Project page with code: https://semantichearing.cs.washington.edu
Creating speech zones with self-distributing acoustic swarms

Malek Itani, Tuochao Chen, Takuya Yoshioka, and 1 more author

Nature Communications, Sep 2023

Abs PDF Website

Imagine being in a crowded room with a cacophony of speakers and having the ability to focus on or remove speech from a specific 2D region. This would require understanding and manipulating an acoustic scene, isolating each speaker, and associating a 2D spatial context with each constituent speech. However, separating speech from a large number of concurrent speakers in a room into individual streams and identifying their precise 2D locations is challenging, even for the human brain. Here, we present the first acoustic swarm that demonstrates cooperative navigation with centimeter-resolution using sound, eliminating the need for cameras or external infrastructure. Our acoustic swarm forms a self-distributing wireless microphone array, which, along with our attention-based neural network framework, lets us separate and localize concurrent human speakers in the 2D space, enabling speech zones. Our evaluations showed that the acoustic swarm could localize and separate 3-5 concurrent speech sources in real-world unseen reverberant environments with median and 90-percentile 2D errors of 15 cm and 50 cm, respectively. Our system enables applications like mute zones (parts of the room where sounds are muted), active zones (regions where sounds are captured), multi-conversation separation and location-aware interaction.
Wireless Earbuds for Low-Cost Hearing Screening

Justin Chan, Antonio Glenn, Malek Itani, and 5 more authors

In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services, Jun 2023

Abs PDF

We present the first wireless earbud hardware that can perform hearing screening by detecting otoacoustic emissions. The conventional wisdom has been that detecting otoacoustic emissions, which are the faint sounds generated by the cochlea, requires sensitive and expensive acoustic hardware. Thus, medical devices for hearing screening cost thousands of dollars and are inaccessible in low and middle income countries. We show that by designing wireless ear-buds using low-cost acoustic hardware and combining them with wireless sensing algorithms, we can reliably identify otoacoustic emissions and perform hearing screening. Our algorithms combine frequency modulated chirps with wideband pulses emitted from a low-cost speaker to reliably separate otoacoustic emissions from in-ear reflections and echoes. We conducted a clinical study with 50 ears across two healthcare sites. Our study shows that the low-cost earbuds detect hearing loss with 100% sensitivity and 89.7% specificity, which is comparable to the performance of a $8000 medical device. By developing low-cost and open-source wearable technology, our work may help address global health inequities in hearing screening by democratizing these medical devices.Open-source hardware and code can be found here: https://github.com/uw-x/OAEbuds
Real-Time Target Sound Extraction

Bandhav Veluri, Justin Chan, Malek Itani, and 3 more authors

In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2023

Abs PDF Website

We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner, while also leveraging the generalization performance of transformer-based architectures. Our evaluations show as much as 2.2–3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2–4x smaller model size and a 1.5–2x lower runtime. We provide code, dataset, and audio samples: https://waveformer.cs.washington.edu/.