Awesome-Talking-Head-Synthesis

Datasets
Survey
Funny Work
Audio-driven
Text-driven
NeRF & 3D & Gaussian Splatting
Metrics
Tools & Software
Slides & Presentations
References
Star History

This repository organizes papers, codes and resources related to generative adversarial networks (GANs) 🤗 and neural radiance fields (NeRF) 🎨, with a main focus on image-driven and audio-driven talking head synthesis papers and released codes. 👤

Papers for Talking Head Synthesis, released codes collections. ✍️

Most papers are linked to PDFs on "arXiv" or journal/conference websites 📚. However, some papers require an academic license to view 🔐.

🔆 This project Awesome-Talking-Head-Synthesis is ongoing - pull requests are welcome! If you have any suggestions (missing papers, new papers, key researchers or typos), please feel free to edit and submit a PR. You can also open an issue or contact me directly via email. 📩

⭐ If you find this repo useful, please give it a star! 🤩

2023.12 Update 📆

Thank you to https://github.com/Curated-Awesome-Lists/awesome-ai-talking-heads, I have added some of its contents, such as Tools & Software and Slides & Presentations. 🙏 I hope this will be helpful.😊

If you have any feedback or ideas on extending this aggregated resource, please open an issue or PR - community contributions are vital to advancing this shared knowledge. 🤝

Let's keep pushing forward to recreate ever more realistic digital human faces! 💪 We've come so far but still have a long way to go. With continued research 🔬 and collaboration, I'm sure we'll get there! 🤗

Please feel free to star ⭐ and share this repo if you find it a valuable resource. Your support helps motivate me to keep maintaining and improving it. 🥰 Let me know if you have any other questions!

Datasets

在这里插入图片描述

Dataset	Download Link	Description
Faceforensics++	Download link
CelebV	Download link
VoxCeleb	Download link	`VoxCeleb`, a comprehensive audio-visual dataset for speaker recognition, encompasses both VoxCeleb1 and VoxCeleb2 datasets.
VoxCeleb1	Download link	`VoxCeleb1` contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.
VoxCeleb2	Download link	Extracted from YouTube videos, VoxCeleb2 includes video URLs and discourse timestamps. As the largest public audio-visual dataset, it is primarily used for speaker recognition tasks. However, it can also be utilized for training talking-head generation models. To obtain download permission and access the dataset, apply here. Requires 300 GB+ storage space.
ObamaSet	Download link	`ObamaSet` is a specialized audio-visual dataset focused on analyzing the visual speech of former US President Barack Obama. All video samples are collected from his weekly address footage. Unlike previous datasets, it exclusively centers on Barack Obama and does not provide any human annotations.
TalkingHead-1KH	Download link	The dataset consists of 500k video clips, of which about 80k are greater than 512x512 resolution. Only videos under permissive licenses are included. Note that the number of videos differ from that in the original paper because a more robust preprocessing script was used to split the videos.
LRW (Lip Reading in the Wild)	Download link	LRW, a diverse English-speaking video dataset from the BBC program, features over 1000 speakers with various speaking styles and head poses. Each video is 1.16 seconds long (29 frames) and involves the target word along with context.
MEAD 2020	Download link	MEAD 2020 is a Talking Head dataset annotated with emotion labels and intensity labels. The dataset focuses on facial generation for natural emotional speech, covering eight different emotions on three intensity levels.
CelebV-HQ	Download link	CelebV-HQ is a high-quality video dataset comprising 35,666 clips with a resolution of at least 512x512. It includes 15,653 identities, and each clip is manually labeled with 83 facial attributes, spanning appearance, action, and emotion. The dataset's diversity and temporal coherence make it a valuable resource for tasks like unconditional video generation and video facial attribute editing.
HDTF	Download link	HDTF, the High-definition Talking-Face Dataset, is a large in-the-wild high-resolution audio-visual dataset consisting of approximately 362 different videos totaling 15.8 hours. Original video resolutions are 720 P or 1080 P, and each cropped video is resized to 512 × 512.
CREMA-D	Download link	CREMA-D is a diverse dataset with 7,442 original clips featuring 91 actors, including 48 male and 43 female actors aged 20 to 74, representing various races and ethnicities. The dataset includes recordings of actors speaking from a set of 12 sentences, expressing six different emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad) at four emotion levels (Low, Medium, High, and Unspecified). Emotion and intensity ratings were gathered through crowd-sourcing, with 2,443 participants rating 90 unique clips each (30 audio, 30 visual, and 30 audio-visual). Over 95% of the clips have more than 7 ratings. For additional details on CREMA-D, refer to the paper link.
LRS2	Download link	LRS2 is a lip reading dataset that includes videos recorded in diverse settings, suitable for studying lip reading and visual speech recognition.
GRID	Download link	The GRID dataset was recorded in a laboratory setting with 34 volunteers, each speaking 1000 phrases, totaling 34,000 utterance instances. Phrases follow specific rules, with six words randomly selected from six categories: "command," "color," "preposition," "letter," "number," and "adverb." Access the dataset here.
SAVEE	Download link	The SAVEE (Surrey Audio-Visual Expressed Emotion) database is a crucial component for developing an automatic emotion recognition system. It features recordings from 4 male actors expressing 7 different emotions, totaling 480 British English utterances. These sentences, selected from the standard TIMIT corpus, are phonetically balanced for each emotion. Recorded in a high-quality visual media lab, the data undergoes processing and labeling. Performance evaluation involves 10 subjects rating recordings under audio, visual, and audio-visual conditions. Classification systems for each modality achieve speaker-independent recognition rates of 61%, 65%, and 84% for audio, visual, and audio-visual, respectively.
BIWI(3D)	Download link	The Biwi 3D Audiovisual Corpus of Affective Communication serves as a compromise between data authenticity and quality, acquired at ETHZ in collaboration with SYNVO GmbH.
VOCA	Download link	VOCA is a 4D-face dataset with approximately 29 minutes of 4D face scans and synchronized audio from 12-bit speakers. It greatly facilitates research in 3D VSG.
Multiface(3D)	Download link	The Multiface Dataset consists of high-quality multi-view video recordings of 13 people displaying various facial expressions. It contains approximately 12,200 to 23,000 frames per subject, captured at 30 fps from around 40 to 160 camera views with uniform lighting. The dataset's size is 65TB and includes raw images (2048x1334 resolution), tracked and meshed heads, 1024x1024 unwrapped face textures, camera calibration metadata, and audio. This repository provides code for downloading the dataset and building a codec avatar using a deep appearance model.
MMFace4D	Download link	The MMFace4D dataset is a large-scale multi-modal dataset for audio-driven 3D facial animation research. It contains over 35,000 sequences captured from 431 subjects ranging in age from 15 to 68 years old. Various sentences from scenarios such as news broadcasting, conversations and storytelling were recorded, totaling around 11,000 utterances. High-fidelity data was captured using three synchronized RGB-D cameras to obtain high-resolution 3D meshes and textures. A reconstruction pipeline was developed to fuse the multi-view data and generate topology-consistent 3D mesh sequences. In addition to the 3D facial motions, synchronized speech audio is also provided. The final dataset covers a wide range of expressive talking styles and facial expressions through a diverse set of subjects and utterances. With its large scale, high quality of data and strong diversity, the MMFace4D dataset provides an ideal benchmark for developing and evaluating audio-driven 3D facial animation models.

Survey

Year	Title	Conference/Journal
2024	A Survey on 3D Human Avatar Modeling — From Reconstruction to Generation	arXiv 2024
2024	Deepfake Generation and Detection: A Benchmark and Survey Github	arXiv 2024
2024	A Comparative Study of Perceptual Quality Metrics for Audio-driven Talking Head Videos Code	arXiv 2024
2024	How NeRFs and 3D Gaussian Splatting are Reshaping SLAM: a Survey 3DGS+SLAM🔥🔥🔥	arXiv 2024
2024	3D Gaussian as a New Vision Era: A Survey 3DGS🔥🔥🔥	arXiv 2024
2024	Advances in 3D Generation: A Survey	arXiv 2024
2024	A Survey on 3D Gaussian Splatting 3DGS🔥🔥🔥on going	arXiv 2024
2024	Neural Radiance Fields: Past, Present, and Future NeRF🔥🔥🔥 Amazing 413 pages	arXiv 2024
2023	From Pixels to Portraits: A Comprehensive Survey of Talking Head Generation Techniques and Applications	arXiv 2023
2023	Human-Computer Interaction System: A Survey of Talking-Head Generation	IEEE
2023	Talking human face generation: A survey	ACM
2022	Deep Learning for Visual Speech Analysis: A Survey	arXiv 2022
2020	What comprises a good talking-head video generation?: A Survey and Benchmark	arXiv 2020

Funny Work

Year	Title	Code	Project	Keywords
2024	[Audio2Photoreal] From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations	Code	Project	Photoreal
2024	[Animate Anyone] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation	Code	Project	🔥Animate (阿里科目三驱动)
2024	[3DGAN] What You See Is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs		Project	🔥Nvidia
2024	[LivePortrait] LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control	Codea	Project	🔥快手

Audio-driven

Year	Title	Conference/Journal	Code	Project	Keywords
2024	[GLDiTalker] GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer	Arxiv 2024
2024	Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation	Arxiv 2024
2024	[JambaTalk] JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model	Arxiv 2024			3D
2024	[Talk Less, Interact Better] Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs	COLM 2024	Code		LLM
2024	[Digital Avatars] Digital Avatars: Framework Development and Their Evaluation	Arxiv 2024
2024	[EmoTalk3D] EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head	ECCV 2024		Project
2024