With its uncompromising focus on user privacy, Apple has faced challenges collecting enough data to train the large language models that power Apple Intelligence features and that will ultimately improve Siri.
To improve Apple Intelligence, Apple has to come up with privacy preserving options for AI training, and some of the methods the company is using have been outlined in a new Machine Learning Research blog post.
Basically, Apple needs user data to improve summarization, writing tools, and other Apple Intelligence features, but it doesn't want to collect data from individual users. So instead, Apple has worked out a way to understand usage trends using differential privacy and data that's not linked to any one person. Apple is creating synthetic data that is representative of aggregate trends in real user data, and it is using on-device detection to make comparisons, providing the company with insight without the need to access sensitive information.
It works like this: Apple generates multiple synthetic emails on topics that are common in user emails, such as an invitation to play a game of tennis at 3:00 p.m. Apple then creates an "embedding" from that email with specific language, topic, and length info. Apple might create several embeddings with varying email length and information.
Those embeddings are sent to a small number of iPhone users who have Device Analytics turned on, and the iPhones that receive the embeddings select a sample of actual user emails and compute embeddings for those actual emails. The synthetic embeddings that Apple created are compared to the embedding for the real email, and the user's iPhone decides which of the synthetic embeddings is closest to the actual sample.
Apple then uses differential privacy to determine which of the synthetic embeddings are most commonly selected across all devices, so it knows how emails are most commonly worded without ever seeing user emails and without knowing which specific devices selected which embeddings as the most similar.
Apple says that the most frequently selected synthetic embeddings it collects can be used to generate training or testing data, or can be used as examples for further data refinement. The process provides Apple with a way to improve the topics and language of synthetic emails, which in turn trains models to create better text outputs for email summaries and other features, all without violating user privacy.
Apple does something similar for Genmoji, using differential privacy to identify popular prompts and prompt patterns that can be used to improve the image generation feature. Apple uses a technique to ensure that it only receives Genmoji prompts that have been used by hundreds of people, and nothing specific or unique that could identify an individual person.
Apple can't see Genmoji associated with a personal device, and all signals that are relayed are anonymized and include random noise to hide user identity. Apple also doesn't link any data with an IP address or ID that could be associated with an Apple Account.
With both of these methods, only users that have opted-in to send Device Analytics to Apple participate in the testing, so if you don't want to have your data used in this way, you can turn that option off.
Apple plans to expand its use of differential privacy techniques for improving Image Playground, Memories Creation, Writing Tools, and Visual Intelligence in iOS 18.5, iPadOS 18.5, and macOS Sequoia 15.5.