GG-Editor: Locally Editing 3D Avatars with
Multimodal Large Language Model Guidance

ReLER Lab, CCAI, Zhejiang University
ACM MM 2024

GG-Editor is a GPT-guided framework for text-driven local editing of 3D avatars. Given a source avatar and only text prompts, GG-Editor produces realistic local editing results with drastic geometry deformations and rich texture details.

Abstract

Text-driven 3D avatar customization has attracted increasing attention in recent years, where precisely editing specific local parts of avatars with only text prompts is particularly challenging. Previous editing methods usually use segmentation or cross-attention masks as constraints for local editing. Although these masks tightly cover existing objects/parts, they may limit editing methods to create drastic geometry deformations beyond the covered contents. From a different perspective, this paper presents a GPT-guided local avatar editing framework, namely GG-Editor. Specifically, GG-Editor progressively mines more reasonable candidate editing regions via harnessing multimodal large language models which already organically assimilate common-sense human knowledge. In order to improve the editing quality of the local areas, GG-Editor explicitly decouples the geometry/appearance optimization, and adopts a global-local synergy editing strategy with GPT-generated local prompts. Moreover, to preserve concepts residing in source avatars, GG-Editor proposes an orthogonal denoising score that orthogonally decomposes editing directions and introduce an explicit term for preservation. Comprehensive experiments demonstrate that GG-Editor with only textual prompts achieves realistic and high-fidelity local editing results, significantly surpassing prior works.

Method Overview

With only textual prompts, GG-Editor first mines reasonable candidate regions for local editing. Using the candidate editing regions as constraints, GG-Editor then performs geometry-appearance decoupled local editing with global-local view synergy and orthogonal denoising score (ODS) loss.
Hulk [mesh]
Hulk holding a can of coke [mesh]
Hulk holding a red apple [mesh]
Hulk with a cowboy hat [mesh]
Hulk wearing pink mini skirt [mesh]
Hulk holding a shoe [mesh]
A DSLR photo of Albus Dumbledore [mesh]
A DSLR photo of barefoot Albus Dumbledore [mesh]
A DSLR photo of Albus Dumbledore wearing sunglasses [mesh]
A DSLR photo of Albus Dumbledore wearing Nike shoes [mesh]
A DSLR photo of Albus Dumbledore with a clown face [mesh]
A DSLR photo of Albus Dumbledore wearing a pair of sandals [mesh]
A DSLR photo of Albus Dumbledore wearing a pair of roller skates [mesh]
A DSLR photo of Albus Dumbledore wearing a pair of football boots [mesh]
A DSLR photo of Albus Dumbledore wearing a pair of golden boots [mesh]
A DSLR photo of Albus Dumbledore holding a long magic wand [mesh]
Stormtrooper [mesh]
Stormtrooper standing on wooden skateboards [mesh]
The anime character Naruto [mesh]
The anime character Naruto with a backpack [mesh]

Incremental Editing

+fat belly [mesh]            +denim shorts [mesh]            +black beard [mesh]

BibTeX

@inproceedings{xu2024gg,
    title={GG-Editor: Locally Editing 3D Avatars with Multimodal Large Language Model Guidance},
    author={Xu, Yunqiu and Zhu, Linchao and Yang, Yi},
    booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
    pages = {10910–10919},
    year={2024}
}