How LLMs are learning to differentiate spatial sounds

4 months ago 75

February 12, 2024 4:26 PM

Humans have unique sensory functions, among them binaural hearing — meaning we can identify types of sound, as well as what direction it’s coming from and how far away it is, and we can also differentiate multiple sources of sound all occurring at once. 

While large language models (LLMs) are impressive in their ability to perform audio question answering and speech recognition, translation and synthesis, they have yet to handle such “in-the-wild” spatial audio input. 

A group of researchers is finally starting to crack that code, introducing BAT, what they are calling the first spatial, audio-based LLM that can reason about sounds in a 3-D environment. 

The model shows impressive precision in classifying types of audio (such as laughter, heartbeat, and splashing water), sound direction (right, left, below) and sound distance (anywhere from 1 to 10 feet). It also has strong capabilities in spatial reasoning in scenarios where two different sounds are overlapping. 

GB Event

GamesBeat Summit Call for Speakers

We’re thrilled to open our call for speakers to our flagship event, GamesBeat Summit 2024 hosted in Los Angeles, where we will explore the theme of “Resilience and Adaption”.

Apply to speak here

“The integration of spatial audio into LLMs represents a significant step towards truly multimodal AI systems,” researchers write. 

The complexities of spatial audio

Spatial audio — sometimes referred to as ‘virtual surround sound’ — creates the illusion ...

Read Entire Article