Scaling up AI to 2.7 billion users is not straightforward, a Facebook engineering manager explains
In a crowded presentation at the AI Expo event here, Facebook Engineering Manager Aditya Kalro described the challenges of rolling out AI at the scale required by the social media behemoth.
Kalro described a major push for AI across all the different Facebook products, but it's used primarily for personalization of the news feed for all 2.7 billion users. A single user simply logging into their account immediately generates 70 or 80 predictions, he said. Around 200 trillion predictions are carried out every day, in the form of face recognition in photos, what content should be seen, advertisement placement, etc.
“Many of these new experiences are made possible because we are running inference on your phone. We don’t send the data back to the data centre and have it come back to you. We need to make this experience as quick and seamless as possible,” Kalro said. “We are running mobile optimized neural networks on 1 billion+ phones per day. That’s 8 generations of iPhones and 6 generations of Android CPU architectures. So you can imagine the amount of work that we’ve done in making this possible.”
AI is also used for such tasks as automated removal of millions of fake accounts every day, and removal of terrorist propaganda: in Q1 2019, 99% of this content was removed without a person ever seeing it.
Every single click that happens on Facebook is considered raw data, said Kalro. It’s converted into features; one feature might be how many posts the user has liked in the last few days, for example. For the biggest models, training requires many thousands of these features.
As a result, the amount of data generated solely by training and optimizing AI and machine learning models is huge. At the start of 2018, 30% of Facebook’s data warehouse was dedicated to this data, and today, that’s 50% (and the size of warehouse has grown significantly in that time).
Overall, “the amount of data we are using for AI has tripled in the last year, while the amount of raw data has stayed approximately the same,” Kalro said.
Facebook uses a mix of classic machine learning and deep learning techniques for different functions: from gradient boosted decision trees used for anomaly detection right up to recurrent neural networks used for language translation, speech recognition and content understanding.
“All of this requires compute,” he said. “Since I joined Facebook 3 years ago, the fleet that we dedicate to machine learning models has grown to more than ten times its size because of all the demand… we have had to augment the fleet with specialist hardware [GPUs].”
As Kalro described it, the beginning of machine learning at Facebook was during simpler times. All the data was on a single machine, it was all read over the network and training was completed in a few hours. As machine learning grew, the network became a bottleneck – while interleaving and various hacks helped, distributed training made the problem worse, since there were multiple machines and multiple trainers. To alleviate this problem, data machines and training machines were co-located as much as possible to reduce the amount of networking required. And “readers” were introduced to read data from the warehouse and stream it directly to the trainers, to reduce pressure on the network. As a bonus, data could be cached at the reader and some pre-processing for data validation done there as well.
“Distributed training is the way of the future,” Kalro said. “When you’re doing something at Facebook scale, training takes a lot longer – multi-day distributed training experiments are happening and as we read more data, training takes longer still.”
In fact, training can take anywhere between 43 minutes for a single, simple job right up to 29 days, a manually set limit.
“During that 29 days, everything has to work perfectly, networking and compute… and your hard disk shouldn’t fail,” he said. “We know 3% of hard disks fail on any given day so we’ve invested a lot in making systems more resilient, and checkpointing, so when things do fail, we can resume them very quickly. We also invested in scheduling, so we can start the job again very fast.”
Other challenges include making use of diurnal compute cycles to use free compute capacity that happens at certain times of the day, something Facebook is still actively working on.
Infrastructure inertia also causes headaches.
“You’re never starting from scratch [with hardware],” he said. “You’ve always got something in your data center that you have to reuse. Heterogeneous hardware is a major problem – you have machines that are really old that you want to try and use as it’s free capacity, and it’s just sitting there. We used the old computers for the readers – they don’t require as much compute as the trainers.”
As a final thought, Kalro mentioned disaster recovery. What happens if a data center goes away, he asked. Inference jobs can be easily sent elsewhere, but reassigning training jobs takes more thought up front. While earlier iterations of the system were in a single place, Facebook now replicates everything, just in case disaster strikes.