New automatic speaker tracking technique doesn't require prior indexing

Bookmark and Share
CSAIL’s Spoken Language Systems Group has unveiled a new technique for automatically tracking speakers in audio recordings. The new technique tackles the task of speaker diarization, or computationally determining how many speakers are present in a recording. Traditional approaches to developing automatic speaker diarization tools involve supervised machine learning, which centers around having the computer program learn from annotated conversations that indicate when different speakers enter a conversation and how many are speaking at a certain time.
In a paper to be published in the October issue of IEEE Transactions on Audio, Speech, and Language Processing, researchers from the Spoken Language Systems Group describe a new technique for speaker diarization that operates without any supervised machine learning. Additionally, the technique allows for differentiation between individual speakers’ voices.
“You can know something about the identity of a person from the sound of their voice, so this technology is keying in to that type of information,” said Jim Glass, a senior research scientist at CSAIL and head of the Spoken Language Systems Group. “In fact, this technology could work in any language. It’s insensitive to that.”