A team from Rochester Institute of Technology has developed a system that translates American Sign Language into words that can be read on a screen.
Communication between deaf people and most hearing people is not easy. Deaf people can’t hear and most hearing people don’t understand sign language.
At the recently held GPU Technology Conference 2017, Syed Tousif Ahmed, a research assistant at the Rochester Institute of Technology's Future Everyday Technology Lab, explained that AI is the answer to bridge this communication gap.
Figure 1: Ahmed at GTC 2017 talking about the complete video captioning system focused on ASL that he and his colleagues developed.
Using computer vision, machine learning, and embedded systems, Ahmed and his colleagues are able to transform American Sign Language (ASL) into words that can be read on a video screen.
"Bridging this gap means a hearing person can interview a deaf person, or someone who is hard of hearing, via Skype or Google Hangout," Ahmed said. "They can hold a meeting or do job interview, and just communicate in a natural way."
Ahmed detailed how to build a complete video captioning system focused on ASL using deep neural networks. The goal: a messaging app that would let a hearing person reply through automatic speech recognition, and a deaf person reply through a video captioning system.
"Another application could be an ASL learning app, where those using American Sign Language would be able to evaluate their proficiency through the video captioning," Ahmed said. "Wouldn’t it be great to get a score so you know that your sign language is acceptable?"
Using TensorFlow, Ahmed developed a neural network for his sequence-to-sequence network, which learned the representation of a sequence of frames to decode the information into a sentence that describes an event in the video. The images are encoded, processed into a feature vector, and then decoded.
The system’s additional features include caption-generation, a data input pipeline, and use of the open-source Seq2Seq encoder-decoder framework to create the models. The system is then deployed on embedded platforms for real-time captioning of live videos.
Each aspect of the system, from interpreting lip reading to physical motions, is layered upon another to help ensure future communication is effortless for everyone.