PhD student Benett Axtell is researching user speech interactions and is a member of the Technologies for Aging Gracefully lab at U of T's computer science department (photo by Ryan Perez)

As the saying goes, a picture is worth a thousand words. A tool under development in the University of Toronto’s Technologies for Aging Gracefully lab is synchronizing memories by attaching audio stories to digital family photos.

“Pictures are one of the best ways to bring memories to mind,” says Benett Axtell, a PhD student in the department of computer science. “As older adults move into smaller living spaces, having a digital way to browse through photos is really important, especially if they’re separated from their children or their grandchildren. It also means that staff working in assisted living homes can interact with older adults through this tool to help them share their memories.”

The tabletop app allows users to swipe and select pictures spread across their tablet while they talk about the memories associated with those pictures. The tool will record and attach the audio to the grouping and uses natural language processing to cluster the photos.

Axtell and the lab’s co-director Cosmin Munteanu will demo the current prototype Frame of Mind: Using Storytelling for Speech-Based Clustering of Family Pictures at the Association for Computing Machinery’s international conference on intelligent user interfaces, being held this week in Tokyo. Munteanu is an assistant professor at U of T Mississauga’s Institute for Communication, Culture, Information and Technology, the Faculty of Information and department of computer science.

Axtell says there’s currently no digital replacement for how people interact with albums or boxes of printed photos. The digital tool’s tabletop design of pictures spread across a table, or in this case, tablet, was inspired by visits to homes of older adults in Toronto. Participants were asked to show Axtell their family pictures, whether in print or in digital format, and to share memories from their photo collection.

“We have this idea that our memories are stored like our photos. We get a roll of film and then we put them in the album: This happened, this happened – turn the page – this happened,” says Axtell. “That’s not how our memories work. And that’s not how people go through [pictures]. They’ll go page by page and then, ‘Oh yeah, that reminds me,’ and flip back five pages.”

Axtell says they’re focused, at least initially, on a very naïve approach to clustering the pictures, using just the language shared in the app. If “cat” was said for eight photos, then those eight photos would be close together. But family descriptions tend to be a lot more casual. There can be a “cat” and a person named “Cat” and so Axtell says their current method of organizing speech is being used as a stepping stone until they can delve into semantic variances.

“One of the things we're really focusing on is making sure that the whole process of how it makes these clusters is really transparent – keeping the human in the loop. You don’t want to give your family photos to a computer and have it go: This is how you should group them into different groups.”

Axtell and Munteanu will also present a paper published with co-authors Carrie Demmans Epp, Yomna Aly and Frank Rudzicz, an assistant professor of computer science and a rehabilitation scientist at the University Health Network, on touch-supported voice recording to facilitate forced alignment of text and speech in e-readers. Forced alignment determines where text transcription is placed within audio speech.

Axtell says one of the goals of this interactive project is to have a younger family member read a book aloud for an older adult with low vision, a common problem amongst the elderly. The older adult can later listen to their family member read, while they follow the large print, highlighted text.

Forced alignment, as Axtell explains, is built upon existing machine learning tools and works very well when people follow a script, such as closed captioning for newscasts, but a family member reading aloud is much more informal, from skipping words to mispronouncing them entirely.

“The text we gave [readers] was Anne of Green Gables. It has really long sentences and phrases [such as] ‘The beautiful, capricious, reluctant Canadian spring.’”

The current tool for forced alignment was too strict and didn’t account for the mistakes and pauses the reader would make, or the reader going entirely off-script, adding their own comments to the story. Although, as Axtell says, the project was inspired by testing forced alignment for their dataset, it has practical applications too, from reducing pronunciation errors in language learning to accurately aligning the e-reader’s highlighted text with the audio playback.

For their doctoral studies, Axtell’s human-computer interaction research will look closely at speech interactions that are implicit to the task at hand but don’t feel like speech interactions to the user.

The tabletop photo prototype needs more external testing, especially when introducing additional storytellers to memory-storing process.

“How is a computer going to deal with that? How are conversations handled? That would be really interesting to see where that goes.”

This article was first published on U of T News