Apple's latest entry in its online Machine Learning Journal focuses on the personalization process that users partake in when activating "Hey Siri" features on iOS devices. Across all Apple products, "Hey Siri" invokes the company's AI assistant, and can be followed up by questions like "How is the weather?" or "Message Dad I'm on my way."
"Hey Siri" was introduced in iOS 8 on the iPhone 6, and at that time it could only be used while the iPhone was charging. Afterwards, the trigger phrase could be used at all times thanks to a low-power and always-on processor that fueled the iPhone and iPad's ability to continuously listen for "Hey Siri."
In the new Machine Learning Journal entry, Apple's Siri team breaks down its technical approach to the development of a "speaker recognition system." The team created deep neural networks and "set the stage for improvements" in future iterations of Siri, all motivated by the goal of creating "on-device personalization" for users.
Apple's team says that "Hey Siri" as a phrase was chosen because of its "natural" phrasing, and described three scenarios where unintended activations prove troubling for "Hey Siri" functionality. These include "when the primary users says a similar phrase," "when other users say "Hey Siri"," and "when other users say a similar phrase." According to the team, the last scenario is "the most annoying false activation of all."
To lessen these accidental activations of Siri, Apple leverages techniques from the field of speaker recognition. Importantly, the Siri team says that it is focused on "who is speaking" and less on "what was spoken."
The overall goal of speaker recognition (SR) is to ascertain the identity of a person using his or her voice. We are interested in “who is speaking,” as opposed to the problem of speech recognition, which aims to ascertain “what was spoken.” SR performed using a phrase known a priori, such as “Hey Siri,” is often referred to as text-dependent SR; otherwise, the problem is known as text-independent SR.
The journal entry then goes into how users enroll in a personalized "Hey Siri" process using explicit and implicit enrollment. Explicit begins the minute that users speak the trigger phrase a few times, but implicit is "created over a period of time" and made during "real-world situations."
The Siri team says that the remaining challenges faced by speaker recognition is figuring out how to get quality performance in reverberant (large room) and noisy (car) environments. You can check out the full Machine Learning Journal entry on "Hey Siri" right here.
Since it began last summer, Apple has shared numerous entries in its Machine Learning Journal about complex topics, which have already included "Hey Siri", face detection, and more. All past entries can be seen on Apple.com.