Facebook’s Contest proves it’s tougher than you think!

As a machine learning enthusiast and practitioner, news of the results of Facebook’s AI contest twigged my ears.
Starting back in September 2019, Facebook AI challenged some of the best academic institutions to develop and algorithm that could identify is a video was generated by AI or if it was real.
Universities including Oxford and Berkley were tasked with a training dataset of over 4000 videos, and by November 2019, 115,000 videos being released and the competition was expanded onto Kaggle.
Ultimately, the most competitive model could identify (in-sample) about 85% of fakes but out of sample, this dropped considerably to 65%! Now this is for sure better than chance, but it’s not as great as we would have hoped.
The reasons why models work different in and out of sample are complicated, but come down to how well the machine learning model can generalise. If it recognises a certain image — a good model should also recognise the image if it is rotated. However, a model that cannot generalise will not be able to recognise unfamiliar samples.
Now remember, Facebook are clever.
Facebook had generated fake videos in a variety of ways so they could reflect the diversity in ways that deep fakes are currently made. Methods such as image enhancement, and additional augmentations and distractors, such as blur, frame-rate modification, and overlays.
They also took advice from the Universities on how to make the deep fakes even harder to identify. All in all, they made the problem difficult not in just one or two ways, but a wide variety of ways: enough ways that makes it difficult to hard-code every permutation.
Now in the testing phase, each participant would have to submit their code into a black box environment and from there, 10000 further videos would be passed through the contestants model to see how well it would perform.
Here is how it becomes tricky
Videos were then altered in ways outside the scope of the training data set by e.g. adding random images to each frame, and changing the frame rate and resolution. These are common methods to distort images and they were used increase the difficulty level. The results indicate that the models developed could not fully adapt to these new settings.
Methods that Competitors Used
Attention Dropping
Microsoft Research developed a “Weakly Supervised Data Augmentation Network (WS-DAN)” that explored the potential of data augmentation, whereby, each training image is first represented in terms of its objects discriminative parts, and then, augmented in ways that include attention cropping and attention dropping. This guides the learning procedure not to overfit as more discriminative features are being identified.
Secondly, the attention regions provide an accurate location of the object, which ensures our model to look at the object closer and further improve the performance.

In relation to this problem, this allows the model to ‘see’ the pictures better and in more detail to discern discriminative face parts. These types of fine-grained visual classification seem to provide an edge.
Gaining an edge in their Custom Architecture
Architecturally as well, many of the participants had used pertained EfficientNet networks but some found in edge in the manner by which they combined predictions from an ensemble.
Ensemble methods are common in machine learning and the higher performers in this challenge showed that an ensemble approach is also useful for dealing with deepfakes.
Non-Learned Enhancements
Finally an interesting point: none of the top performers had used any investigative methods such as searching for noise fingerprints or other characteristics that derive from the image creation process. Given that none of the finalists had used any of these methods, it suggest they aren’t useful or just not widespread. Either way, there’s scope for research in this space
The results of the competition showed that deepfake videos are hard to identify because they require well generalised models. We’ve seen time and time again that machine learning models are often over-fitted to a certain problem so that if the input space into a model is altered (such as an image being rotated), then the model can no longer identify what the image is anymore.
That being said, Robustness methods are in growing demand to ensure that the model can work and a lot of work is being done in this space by the big players in the field. Work will progress here quickly but as the lockdown around the world continues and more people spend even more time on the internet, demand for this technology can only increase.
Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.
Keep up to date with my latest work here!
Leave a Reply