The creation of video content in many languages, with local vocal and face expressions, is a costly process. To get an optimal result, one must reshoot each targeted language. Thanks to AI, it is possible to create many translations from a single shot. Interdigital R&D and Inria combined their knowledge to create a first system able to create a 3D face animation from audio and video. The identity of the actor is automatically extracted from a single video to create a 3D face identity. The animation of the lips is automatically extracted from audio to animate the lips of the 3D face.
Media content is much more enjoyable when actors talk in viewers' native language. It includes speaking as well as natural expressions, especially facial expressions. To get such results, big brands must shoot and produce each content locally. It leads to very expensive production campaigns.
A solution to reduce costs is to shoot in one language and seamlessly repurpose to many countries. Thanks to usual tools in media industry called "face rigs", artists can model a 3D face of any actor, and edit or transfer facial expressions from one actor to another. For instance, an actor can be shot for each language in a studio, where only its face is shot. Then, its facial expression can be applied to the actor of the main footage. For instance, a popular actor like Tom Cruise speaks in English in the main footage, and an Italian actor can speak in its native language in front of a camera. Then the lips and facial expressions of the Italian actor can be applied to a digital Tom Cruise. The result is Tom Cruise speaking Italian in a very natural way.
Current solutions to get such a result call from support from many different artists who manually capture facial expressions from one actor and apply them to another one. This process is extremely time-consuming: to capture 2 to 3 seconds of facial expression, 5 human days are required on average. Although it reduces the overall cost, it remains very expensive.
Thanks to the power of current AI tools, we can automate as much as possible the processes involved in this workflow. Considering the face identity and expressions, Interdigital R&D is able to automatically compute the face rig parameters from a single video. For instance, the main footage can be used to create a 3D face model of the main actors. Similarly, the footage for each language can be used to get the face animation parameters. Considering the voice, Inria is able to automatically compute the mouth and lips parameters from an audio file. It can be the speaking of the actual dubbing actor or an audio speech synthesized from a text file. By the end, all parameters are combined (identity, face expressions and lips movements) to create a realistic dubbed face animation.
Last but not least, the overall workflow respects the industry standards. It means that artists keep full control over the end result and edit/intervene at all steps the output of the capture and synthesis tools. For instance, artists could like to change a bit the expressions of dubbing actors to better fit the context of the main scene. This kind of update is likely to happen since the dubbing actors are performing alone in front of a camera, and could not be aware of all details on the main stage.
Cooperation with the partners of the pilot
Collaboration between Inria Nancy and Interdigital R&D France started in February 2019. Several meetings helped in the understanding of all involved technologies. Starting summer 2019, data exchanges allowed first experiments to animate face rigs from Interdigital using blendshapes computed by Inria. It resulted in a first processing in January 2020 that allows the animation of the lips of a face captured from a monocular video. To get this result, the 3D face rig is automatically extracted from a video (see Fig.2). In the meantime, lips blendshapes are computed from an audio speech (recorded or synthetized) (see Fig.3). Finally, the face rig is animated using the blendshapes, which leads to an actor reciting a text he has never spoken.
In April 2020, these preliminary results were further improved to animate the whole head. In the case, the face pose and expressions are captured from one video (not necessarily the one used to compute the face rig), and used to animate the face. The lips are still according to the chosen audio speech.
Exploitation of the AI4EU platform
- Collection of User Requirements
We have collected and discussed user requirements on the AI4EU platform and confirmed their existence or potential realization with WP3. It includes the hardware and software specifications, like the use of GPU units and Deep Learning frameworks.
- Contribution to the AI4EU Resource Catalogue
UPM submitted a demonstration software in the AI4EU Resource catalogue. The code is compiled inside a Docker to easily download it and test it with your own videos. The algorithm takes as input a normal video (a video of you for example) and another video where the facial expression is translated (Politician, Famous Person or other video faces). The output provides the translation of the input face facial expression to the second video. The Docker work in CPU or GPU if is available. Finally, a fine-tuning process should be performed to adapt properly the input faces.
- Adoption of the AI4EU Experiments Service
The AI4Media pilot uses a private workspace on the Teralab platform, which is available to all partners in the pilot. The workspace is used for data sharing between partners, and for computations of intermediate data and final results.
Ethical assessment of the pilot
Interdigital and Inria, with the support of WP5 partners and SRL, have participated to the ethical assessment of the pilot. We have provided a detailed description of the technical properties of the main workflow of the pilot and have reported on ethical risks related to undesired usages of the technology.
The AI4Media team is currently working on the production of better results, specifically the creation of a better rendering. It will be thanks to the expertise of the Polytechnic University of Madrid (UPM) in this scope. We also plan to process more realistic videos, like TV news. An evaluation campaign is also in preparation to helps in the benchmarking of the solutions.