Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

Song, Yeji; Kim, Jimyeong; Park, Wonhark; Shin, Wonsik; Rhee, Wonjong; Kwak, Nojun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.14155 (cs)

[Submitted on 21 Mar 2024]

Title:Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

Authors:Yeji Song, Jimyeong Kim, Wonhark Park, Wonsik Shin, Wonjong Rhee, Nojun Kwak

View PDF HTML (experimental)

Abstract:In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new, transient context. However, the existing methods often 1) are significantly affected by the input images, eg., generating images with the same pose, and 2) exhibit deterioration in the subject's identity. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the textual embedding containing the desired pose information. To address this issue, we propose orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our results demonstrate the effectiveness and robustness of our method, which offers highly flexible zero-shot generation while effectively maintaining the subject's identity.

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.14155 [cs.CV]
	(or arXiv:2403.14155v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.14155

Submission history

From: Yeji Song [view email]
[v1] Thu, 21 Mar 2024 06:03:51 UTC (16,875 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators