MLMs之Janus:《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻譯與解讀
導讀:這篇論文介紹了Janus-Pro,一個改進版的統(tǒng)一多模態(tài)理解和生成模型。Janus-Pro 通過多方面的改進,在統(tǒng)一多模態(tài)理解和生成領(lǐng)域取得了顯著的進展,為該領(lǐng)域的研究提供了新的思路和方向。
>> 背景痛點:現(xiàn)有統(tǒng)一多模態(tài)模型的不足:現(xiàn)有的統(tǒng)一多模態(tài)理解和生成模型通常使用相同的視覺編碼器處理兩種任務(wù),導致在多模態(tài)理解方面性能欠佳,因為兩種任務(wù)對圖像表示的需求不同。 Janus模型雖然通過解耦視覺編碼解決了部分問題,但在1B參數(shù)規(guī)模下,訓練數(shù)據(jù)有限,模型容量較小,導致在短提示圖像生成和文本到圖像生成的穩(wěn)定性方面表現(xiàn)不足。
>> 具體的解決方案:Janus-Pro針對Janus模型的不足,從三個方面進行了改進:
● 優(yōu)化的訓練策略:修改了Janus的三階段訓練流程。具體包括:延長第一階段訓練,充分利用ImageNet數(shù)據(jù)建模像素依賴;第二階段專注于使用普通文本到圖像數(shù)據(jù)訓練,提高● 訓練效率;調(diào)整第三階段的監(jiān)督微調(diào)數(shù)據(jù)比例,平衡多模態(tài)理解和視覺生成能力。
● 數(shù)據(jù)擴展:顯著增加了訓練數(shù)據(jù)。多模態(tài)理解方面,增加了約9000萬個樣本,涵蓋圖像字幕、表格、圖表和文檔理解等數(shù)據(jù);視覺生成方面,增加了約7200萬個合成美學數(shù)據(jù),提高數(shù)據(jù)質(zhì)量,改善生成圖像的穩(wěn)定性和美感。
● 模型規(guī)模擴展:將模型規(guī)模從1.5B參數(shù)擴展到7B參數(shù),驗證了視覺編碼解碼方法的可擴展性。
>> 核心思路步驟:Janus-Pro的核心思路是解耦視覺編碼,分別為多模態(tài)理解和視覺生成任務(wù)設(shè)計獨立的編碼器。具體步驟如下:
● 獨立編碼:使用SigLIP編碼器提取圖像的高維語義特征用于理解任務(wù);使用VQ tokenizer將圖像轉(zhuǎn)換為離散ID用于生成任務(wù)。
● 特征映射:使用理解適配器和生成適配器將圖像特征映射到LLM的輸入空間。
● 多模態(tài)融合:將映射后的特征序列與文本提示拼接成多模態(tài)特征序列。
● 統(tǒng)一處理:將多模態(tài)特征序列輸入到統(tǒng)一的自動回歸Transformer中進行處理。
● 獨立預(yù)測頭:視覺生成任務(wù)使用隨機初始化的預(yù)測頭進行圖像預(yù)測。
>> 優(yōu)勢:
● 改進的多模態(tài)理解能力:Janus-Pro在多個多模態(tài)理解基準測試中取得了最優(yōu)結(jié)果,顯著優(yōu)于Janus和其他一些模型,即使與參數(shù)量更大的模型相比也具有競爭力。
● 顯著提升的文本到圖像生成能力:Janus-Pro在文本到圖像生成任務(wù)上,無論是GenEval還是DPG-Bench,都取得了顯著的性能提升,在指令遵循能力方面表現(xiàn)出色,生成圖像質(zhì)量更高,細節(jié)更豐富,穩(wěn)定性更好。
● 模型的可擴展性:7B參數(shù)的Janus-Pro模型驗證了該方法的可擴展性,更大的模型帶來了更快的收斂速度。
>> 結(jié)論和觀點:
● Janus-Pro通過改進訓練策略、擴展數(shù)據(jù)和增加模型規(guī)模,顯著提升了多模態(tài)理解和文本到圖像生成能力。
● 解耦視覺編碼是提高統(tǒng)一多模態(tài)模型性能的關(guān)鍵。
● 盡管取得了顯著進展,Janus-Pro仍然存在一些局限性,例如輸入分辨率限制(384x384)影響了其在細粒度任務(wù)中的性能,以及圖像分辨率低導致細節(jié)不足的問題。 未來可以通過提高圖像分辨率來解決這些問題。
目錄
相關(guān)文章
MLMs之Janus:《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻譯與解讀
MLMs之Janus:《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻-CSDN博客
MLMs之Janus:Janus/Janus-Pro的簡介、安裝和使用方法、案例應(yīng)用
https://yunyaniu.blog.csdn.net/article/details/145385376
《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻譯與解讀
地址
論文地址:https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf
時間
2025年1月27日
作者
DeepSeek團隊
Abstract
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specif-ically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capa-bilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
在這項工作中,我們推出了 Janus-Pro,這是之前工作 Janus 的高級版本。具體而言,Janus-Pro?融合了(1)優(yōu)化的訓練策略,(2)擴充的訓練數(shù)據(jù),以及(3)對更大模型規(guī)模的支持。憑借這些改進,Janus-Pro 在多模態(tài)理解和文本到圖像指令遵循能力方面取得了顯著進步,同時提升了文本到圖像生成的穩(wěn)定性。我們希望這項工作能激發(fā)該領(lǐng)域的進一步探索。代碼和模型已公開可用。
Figure 1 | Multimodal understanding and visual generation results from our Janus-Pro. For multi-modal understand, we average the accuracy of POPE, MME-Perception, GQA, and MMMU. The scores of MME-Perception are divided by 20 to scale to [0, 100]. For visual generation, we evaluate the performance on two instruction-following benchamrks, GenEval and DPG-Bench. Overall, Janus-Pro outperforms the previous state-of-the-art unified multimodal models as well as some task-specific models. Best viewed on screen.圖 1 | 我們的 Janus-Pro 的多模態(tài)理解和視覺生成結(jié)果。對于多模態(tài)理解,我們對 POPE、MME-Perception、GQA 和 MMMU 的準確率取平均值。MME-Perception 的分數(shù)除以 20 以縮放到 [0, 100] 范圍。對于視覺生成,我們在兩個指令遵循基準 GenEval 和 DPG-Bench 上評估其性能??傮w而言,Janus-Pro 超過了之前的最先進的統(tǒng)一多模態(tài)模型以及一些特定任務(wù)的模型。建議在屏幕上查看效果最佳。
1、Introduction
Recent advancements in unified multimodal understanding and generation models have demonstrated significant progress [30, 40, 45, 46, 48, 50, 54, 55]. These approaches have been proven to enhance the instruction-following capabilities in visual generation tasks while re-ducing model redundancy. Most of these methods utilize the same visual encoder to process inputs for both multimodal understanding and generation tasks. Since the representations required for these two tasks differ, this often results in suboptimal performance in multimodal understanding. To address this issue, Janus [46] proposes decoupling visual encoding, which alleviates the conflict between multimodal understanding and generation tasks, achieving excellent performance in both tasks.
As a pioneering model, Janus is validated at the 1B parameter scale. However, due to the limited amount of training data and the relatively small model capacity, it exhibites certain shortcomings, such as suboptimal performance on short prompts image generation and unstable text-to-image generation quality. In this paper, we introduce Janus-Pro, an enhanced version of Janus that incorporates improvements across three dimensions: training strategies, data, and model size. The Janus-Pro series includes two model sizes: 1B and 7B, demonstrating scalability of the visual encoding decoding method.
統(tǒng)一多模態(tài)理解和生成模型的最新進展已取得顯著成果[30, 40, 45, 46, 48, 50, 54, 55]。這些方法已被證明能夠提升視覺生成任務(wù)中的指令遵循能力,同時減少模型冗余。大多數(shù)這些方法都使用相同的視覺編碼器來處理多模態(tài)理解和生成任務(wù)的輸入。由于這兩個任務(wù)所需的表示不同,這往往導致多模態(tài)理解任務(wù)的表現(xiàn)不佳。為了解決這個問題,Janus [46] 提出了視覺編碼解耦,這緩解了多模態(tài)理解和生成任務(wù)之間的沖突,在這兩個任務(wù)中都取得了出色的表現(xiàn)。
作為開創(chuàng)性的模型,Janus 在 10 億參數(shù)規(guī)模上得到了驗證。然而,由于訓練數(shù)據(jù)量有限以及模型容量相對較小,它存在一些不足之處,例如在短提示圖像生成方面的表現(xiàn)欠佳以及文本到圖像生成質(zhì)量不穩(wěn)定。在本文中,我們介紹了 Janus-Pro,這是 Janus 的增強版,在訓練策略、數(shù)據(jù)和模型規(guī)模三個維度上都有所改進。Janus-Pro 系列包含兩種模型規(guī)模:10 億參數(shù)和 70 億參數(shù),展示了視覺編碼解碼方法的可擴展性。
We evaluate Janus-Pro on multiple benchmarks, and the results reveal its superior multi-modal understanding capabilities and significantly improved text-to-image instruction-following performance. Specifically, Janus-Pro-7B achieved a score of 79.2 on the multimodal understand-ing benchmark MMBench [29], surpassing state-of-the-art unified multimodal models such as Janus [46] (69.4), TokenFlow [34] (68.9) and MetaMorph [42] (75.2). Additionally, in the text-to-image instruction-following leaderboard GenEval [14], Janus-Pro-7B scores 0.80, outperforming Janus [46] (0.61), DALL-E 3 (0.67), and Stable Diffusion 3 Medium [11] (0.74).
我們在多個基準測試中對 Janus-Pro 進行了評估,結(jié)果表明其具有卓越的多模態(tài)理解能力和顯著提升的文本到圖像指令遵循性能。具體而言,Janus-Pro-7B 在多模態(tài)理解基準 MMBench [29] 上獲得了 79.2 的分數(shù),超過了諸如 Janus [46](69.4)、TokenFlow [34](68.9)和 MetaMorph [42](75.2)等最先進的統(tǒng)一多模態(tài)模型。此外,在文本到圖像指令遵循排行榜 GenEval [14] 上,Janus-Pro-7B 的得分是 0.80,優(yōu)于 Janus [46](0.61)、DALL-E 3(0.67)和 Stable Diffusion 3 Medium [11](0.74)。
Figure 2 | Comparison of text-to-image generation between Janus-Pro and its predecessor,Janus. Janus-Pro delivers more stable outputs for short prompts, with improved visual quality,richer details, and the ability to generate simple text. The image resolution is 384 × 384. Best viewed on screen.圖 2 | Janus-Pro 與其前身 Janus 在文本轉(zhuǎn)圖像生成方面的比較。Janus-Pro 對于短提示能提供更穩(wěn)定的輸出,視覺質(zhì)量更高,細節(jié)更豐富,并且能夠生成簡單的文本。圖像分辨率為 384×384。建議在屏幕上查看效果最佳。
Figure 3 | Architecture of our Janus-Pro. We decouple visual encoding for multimodal under-standing and visual generation. “Und. Encoder” and “Gen. Encoder” are abbreviations for “Understanding Encoder” and “Generation Encoder”, respectively. Best viewed on screen.圖 3 | 我們的 Janus-Pro 架構(gòu)。我們將用于多模態(tài)理解的視覺編碼與用于視覺生成的視覺編碼分離開來。“Und. Encoder”和“Gen. Encoder”分別是“理解編碼器”和“生成編碼器”的縮寫。建議在屏幕上查看以獲得最佳效果。
Conclusion
This paper introduces improvements to Janus from three aspects: training strategy, data, and model size. These enhancements have led to significant advancements in both multimodal understanding and text-to-image instruction-following capabilities. However, Janus-Pro still has certain limitations. In terms of multimodal understanding, the input resolution is limited to 384 × 384, which affects its performance in fine-grained tasks such as OCR. For text-to-image generation, the low resolution, combined with reconstruction losses introduced by the vision tokenizer, results in images that, while rich in semantic content, still lack fine details. For example, small facial regions occupying limited image space may appear under-detailed. Increasing the image resolution could mitigate these issues.
本文從訓練策略、數(shù)據(jù)和模型規(guī)模三個方面介紹了對 Janus 的改進。這些改進在多模態(tài)理解和文本到圖像的指令遵循能力方面都取得了顯著進展。然而,Janus-Pro 仍存在一些局限性。在多模態(tài)理解方面,輸入分辨率限制在 384×384,這影響了其在諸如 OCR 等細粒度任務(wù)中的表現(xiàn)。對于文本到圖像生成,低分辨率加上視覺標記器引入的重建損失,導致生成的圖像雖然語義豐富,但仍缺乏細節(jié)。例如,占據(jù)圖像空間有限的小面部區(qū)域可能會顯得不夠清晰。提高圖像分辨率可以緩解這些問題。