You can be sure that a problem has been almost completely solved when researchers begin working on issues on its periphery. That is what has been happening in the areas of automatic speech recognition and speech synthesis in recent years, where advances in artificial intelligence (AI) have almost perfected these tools. The next frontier, according to a team at MIT’s CSAIL, is imitating sounds, in much the same way that humans copy a bird’s song or a dog’s bark.
Imitating sounds with our voice is an intuitive and practical way to convey ideas when words fall short. This practice, comparable to sketching a quick picture to illustrate a concept, uses the vocal tract to mimic sounds that defy explanation. Inspired by this natural ability, the researchers have created an AI system that can produce human-like vocal imitations without prior training or exposure to human vocal impressions.
A schematic of the model of the vocal tract (📷: M. Caren et al.)
This may seem like a silly or unimportant topic to tackle at first blush, but the more one considers it, the more the power of sound imitation becomes clear. If everything under the hood of your car is a mystery to you, then how do you explain a problem to a mechanic over the phone? Words won’t help when you do not know the words to use, but a series of booms, bangs, and clicks might speak volumes to a mechanic. And if we want to have similar conversations with AI tools in the future, they will need to understand how to imitate, and interpret, these types of imperfect sound reproductions that we make.
The system developed by the team functions by modeling the human vocal tract, simulating how the voice box, throat, tongue, and lips shape sounds. An AI algorithm inspired by cognitive science controls this model, producing imitations that reflect the ways humans adapt sounds for communication. The AI can replicate diverse real-world sounds, from rustling leaves to an ambulance siren, and can even work in reverse — interpreting human vocal imitations to identify the original sounds, such as distinguishing between a cat’s meow and hiss.
To get to this goal, the researchers developed three progressively advanced versions of the model. The first aimed to replicate real-world sounds but did not align well with human behavior. The second, “communicative” model focused on the distinctive features of sounds, prioritizing characteristics listeners would find most recognizable, such as imitating a motorboat’s rumble rather than water splashes. The third version added a layer of effort-based reasoning, avoiding overly rapid, loud, or extreme sounds, resulting in more human-like imitations that closely mirrored human decision-making during vocal mimicry.
A series of experiments revealed that human judges favored the AI-generated imitations in many cases, with the artificial sounds being preferred by up to 75 percent of the participants. Given this success, the researchers hope that the model could enable future sound designers, musicians, and filmmakers to interact with computational systems in creative ways, such as searching sound databases through vocal imitation. It may also deepen understanding of language development, imitation behaviors in animals, and how humans abstract sounds.
However, the current model has limitations. It struggles with certain consonants like “z” and cannot yet replicate speech, music, or culturally specific imitations. But despite these challenges, this work is an important step toward understanding how physical and social factors shape vocal imitations and the evolution of language. It could lay the groundwork for both practical applications and deeper insights into human communication.