Understanding Visual ChatGPT and Foundation Models

9 min readMar 12, 2023

Paper Link : arxiv.org
Github Link : github

1. Introduction

In recent years, large language models (LLMs) such as T5, BLOOM, and GPT-3 have shown incredible progress. One of the most significant breakthroughs is ChatGPT, which is built upon InstructGPT and is specifically trained to interact with users in a genuinely conversational manner. However, ChatGPT is limited in its ability to process visual information, as it is trained with a single language modality.

To address this limitation, the authors propose a system named Visual ChatGPT, which combines ChatGPT with Visual Foundation Models (VFMs) to enable ChatGPT to handle complex visual tasks. Instead of training a new multimodal ChatGPT from scratch, the authors build Visual ChatGPT directly based on ChatGPT and incorporate a variety of VFMs.

To bridge the gap between ChatGPT and VFMs, the authors propose a Prompt Manager that explicitly tells ChatGPT the capability of each VFM and specifies the input-output formats, converts different visual information to language format, and handles the histories, priorities, and conflicts of different Visual Foundation Models. With the help of the Prompt Manager, ChatGPT can leverage these VFMs and receives their feedback in an iterative manner until it meets the requirements of users or reaches the ending condition.

The authors conducted massive zero-shot experiments and showed abundant cases to verify the understanding and generation ability of Visual ChatGPT. Overall, Visual ChatGPT opens the door to combining ChatGPT and Visual Foundation Models and enables ChatGPT to handle complex visual tasks.

2. Related Work

2.1.1 Natural Language and Vision

Language and vision are two of the main ways we receive information in our daily lives. They are often intertwined, and many tasks require both language and vision to produce satisfactory results. One example of such a task is visual question answering, which takes an image and a corresponding question and generates an answer based on the information in the image.

Large language models like InstructGPT have been successful in processing natural language, but they are incapable of processing visual information. To combine the processing ability of vision with language models, several challenges must be overcome. These challenges include the difficulty of training both large language models and vision models, as well as the need for well-designed instructions and cumbersome conversions to connect different modalities.

While some works have explored using pre-trained language models to improve performance on vision-language tasks, these methods are limited to specific tasks and require labeled data for training. Despite these challenges, combining natural language and vision is an important area of research that has the potential to revolutionize the way we interact with machines and process information in our daily lives.

2.2. Pre-trained Models for VL tasks

The paper discusses the challenges and methods of fusing visual information with language using large language models (LLMs) and pre-trained image encoders. It explains that natural language and vision are the two main mediums for transmitting information in our lives and that combining these two modalities is essential for tasks such as visual question answering.

To extract visual features, frozen pre-trained image encoders have been used in earlier works, but recent advancements like the CLIP pre-training and frozen ViT model have shown promise. On the other hand, pre-trained LLMs like Transformer have demonstrated powerful text understanding and generation capabilities that can also benefit VL modelling. However, training these models is challenging due to the increased number of model parameters. Therefore, the paper explores the idea of directly leveraging off-the-shelf frozen pre-trained LLMs for VL tasks.

In summary, the paper highlights the need to fuse language and vision for various applications and discusses the different methods of achieving this goal, including using pre-trained image encoders and pre-trained LLMs. It also emphasizes the challenges of training these models and proposes solutions to overcome them.

2.3. Guidance of Pre-trained LLMs for VL tasks

The paper discusses a technique called Chain-of-Thought (CoT), which aims to improve the reasoning abilities of large language models (LLMs) by asking them to generate intermediate answers for complex tasks. CoT has been divided into two categories: Few-Shot CoT and Zero-Shot CoT. In Few-Shot CoT, the LLMs are trained on several demonstrations, while in Zero-Shot CoT, LLMs are self-improved by leveraging self-generated rationales. Most studies have focused on a single modality, but the authors propose MultimodalCoT, which incorporates both language and vision modalities into a two-stage framework for reasoning. This work extends the potential of CoT to massive tasks, including text-to-image generation, image-to-image translation, and image-to-text generation. The paper suggests that CoT has the potential to significantly improve the reasoning capabilities of LLMs, making them more effective at handling complex tasks.

3. Visual ChatGPT

The paper introduces a new type of conversational AI system called Visual ChatGPT, which incorporates both visual and linguistic inputs to generate responses. The system is designed to handle complex queries that require multi-step reasoning and collaboration between multiple visual foundation models (VFMs). The system is based on a series of VFMs and intermediate outputs, which are used to obtain the final response. The Visual ChatGPT system has several components, including the system principle, the visual foundation model, the dialogue history, the user query, the history of reasoning, the intermediate answer, and the prompt manager. The system is designed to handle complex queries and to produce step-by-step intermediate answers by invoking different VFMs logically.

– System Principle P: System Principle provides basic rules for Visual ChatGPT, e.g., it should be sensitive to the image filenames, and should use VFMs to handle images instead of generating the results based on the chat history.

— Visual Foundation Model F: One core of Visual ChatGPT is the combination of various VFMs: F = {f1, f2, …, fN }, where each foundation model fi contains a determined function with explicit inputs and outputs.

– History of Dialogue :We define the dialogue history of i-th round of conversation as the string concatenation of previous question answer pairs, i.e, {(Q1, A1),(Q2, A2), · · · ,(Qi−1, Ai−1)}. Besides, we truncate the dialogue history with a maximum length threshold to meet the input length of ChatGPT model.

– User query Qi : In visual ChatGPT, query is a general term, since it can include both linguistic and visual queries. For instance, Fig. 1 shows an example of a query containing both the query text and the corresponding image.

– History of Reasoning R : To solve a complex question, Visual ChatGPT may require the collaboration of multiple VFMs. For the i-th round of conversation, R() is all the previous reasoning histories from j invoked VFMs.

– Intermediate Answer A(): When handling a complex query, Visual ChatGPT will try to obtain the final answer step-by-step by invoking different VFMs logically, thus producing multiple intermediate answers.

– Prompt Manager M: A prompt manager is designed to convert all the visual signals into language so that ChatGPT model can understand. In the following subsections, we focus on introducing how M manages above different parts: P, F, Qi , F(A (j) i ).

3.1. Prompt Managing of System Principles

Visual ChatGPT is a system that can understand and generate responses to both text and visual-related tasks. It accomplishes this by integrating different VFMs, or visual foundation models. To ensure the accuracy and reliability of the system, several system principles have been customized and transferred into prompts that ChatGPT can understand. These prompts help the system to perform tasks more efficiently and effectively.

One important aspect of Visual ChatGPT is its ability to access a list of VFMs to solve various visual and language-related tasks. The system is designed to be flexible and easy to support new VFMs and tasks. Additionally, the system is sensitive to the filenames of images to avoid ambiguity, making sure it retrieves and manipulates the correct files.

To tackle complex tasks that require multiple VFMs, Visual ChatGPT uses a chain-of-thought approach, which helps it to decompose a problem into subproblems and dispatch the appropriate VFMs. The system also follows strict reasoning formats and uses regex matching algorithms to parse intermediate reasoning results, which allows it to determine the next execution step, such as triggering a new VFM or returning a final response.

Finally, the system is designed to be reliable by following prompts that prevent it from fabricating fake image filenames or facts. The collaboration of multiple VFMs also increases the system’s reliability, making it less likely to generate results based solely on conversation history.

3.2. Prompt Managing of Foundation Models

it uses various VFMs (Vision Foundation Models) that perform different tasks, such as image generation and editing. However, these different VFMs can have similarities that need to be distinguished to ensure that Visual ChatGPT accurately understands and handles different visual tasks.

To address this challenge, the Prompt Manager was introduced to Visual ChatGPT. The Prompt Manager provides clear guidelines for Visual ChatGPT to understand and execute the different VFMs correctly. The Prompt Manager defines four key aspects: name, usage, inputs/outputs, and examples. The name prompt gives a concise overview of the overall function for each VFM. The usage prompt describes specific scenarios where the VFM should be used. The inputs and outputs prompt outlines the format of inputs and outputs required by each VFM.

The Prompt Manager ensures that Visual ChatGPT understands and handles different VL (Visual Language) tasks accurately, even when multiple VFMs share similarities. This system has the potential to revolutionize the way we interact with visual information.

3.3. Prompt Managing of User Queries

To handle image-related queries, it generates unique filenames with a universally unique identifier (UUID) for newly uploaded images, and fake dialogue histories are generated for these images. For existing images, it ignores the filename check to avoid ambiguity. To ensure the successful trigger of Visual Feature Modules (VFMs), a suffix prompt is added to the user query that encourages Visual ChatGPT to use foundation models instead of relying on its imagination, and to provide specific outputs generated by the foundation models. This helps Visual ChatGPT provide accurate and helpful responses to user queries.

3.4. Prompt Managing of Foundation Model Outputs

Visual ChatGPT uses intermediate outputs from different VFMs to summarize and feed them to ChatGPT for subsequent interactions, which allows it to call for more VFMs to finish the user’s command. To make the outputs more logical and help the LLMs better understand the reasoning process, the images generated from VFMs are saved with a specific naming rule that hints at the intermediate result attributes. Visual ChatGPT can automatically call for more VFMs to solve the current problem by extending one suffix “Thought:” at the end of each generation. If the user’s command is ambiguous, Visual ChatGPT asks for more details to better leverage VFMs, and this is a critical design to ensure that the LLMs do not arbitrarily tamper with or speculate about the user’s intention without basis.

4. Experiments

5. Limitations

The use of Visual ChatGPT in multi-modal dialogue shows potential, but there are some limitations that should be considered. Firstly, it heavily relies on ChatGPT and VFMs, which means its performance is influenced by the accuracy and effectiveness of these models. Secondly, it requires significant prompt engineering to convert VFMs into language, which can be time-consuming and requires expertise. Thirdly, it may have limited real-time capabilities due to invoking multiple VFMs. Fourthly, the maximum token length in ChatGPT may limit the number of foundation models used, requiring a pre-filter module. Lastly, the ability to easily plug and unplug foundation models may raise security and privacy concerns that need to be carefully considered and checked.

6. Conclusion

The authors propose a new system called Visual ChatGPT that allows users to interact with ChatGPT using visual information, in addition to language. The system uses a series of prompts to help ChatGPT understand visual information and solve complex visual questions. The authors conducted extensive experiments that showed the system’s potential for different tasks, but there are limitations, including the need for self-correction to ensure the system understands human intentions. Future work will address this issue to improve the system’s inference time.

Reference

Wu, Chenfei, et al. “Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models.” arXiv preprint arXiv:2303.04671 (2023).