Day 1 - Module 2: Exercises

These exercises are based on the concepts and code presented in Module 2: Working with Multimodal Models (src/02-multimodal-models/).

Exercise 2.1: Describing a Different Image

Goal: Practice using the vision model to analyze a new image and answer a specific question.

Find an image online (e.g., a picture of a cat, a landscape, a famous landmark). Save it to the src/02-multimodal-models/ directory (e.g., as my_image.jpg).
Open the src/02-multimodal-models/inspect-image.py script.
Modify the script to use your new image file:
- Update the filename in the get_image_data_url call (e.g., get_image_data_url("my_image.jpg", "jpg")). Make sure the format ("jpg", "png", etc.) matches your image.
Modify the text part of the user prompt to ask a specific question about your image (e.g., "What is the main animal in this image?", "What time of day does it look like in this picture?", "Describe the building in this image.").
Run the script (python inspect-image.py) and observe the model's description based on your question and image.

Exercise 2.2: Comparing Different Images for Similarities

Goal: Practice using the vision model to compare two different images, focusing on similarities.

Find two images online that have some similarities but also differences (e.g., two different types of dogs, two different models of cars, two different chairs). Save them to the src/02-multimodal-models/ directory (e.g., image_A.png, image_B.png).
Open the src/02-multimodal-models/compare-images.py script.
Modify the script to use your two new images:
- Update the filenames and formats in the two get_image_data_url calls.
Modify the text part of the user prompt to ask the model to focus on the similarities between the two images (e.g., "Look at these two images. What features do they have in common?").
Run the script (python compare-images.py) and analyze the model's response highlighting the similarities.

Exercise 2.3 (Conceptual): Image Input as a "Tool"

Goal: Think about how multimodal input relates to the concept of tool calling.

Multimodal models can process different types of input, like text and images, simultaneously. This capability can be conceptually compared to having built-in tools for understanding different data modalities, as illustrated below:

Multimodal Agents Concept

Review the tool-calling.py script from Module 1, where an external get_current_time function was defined and called.
Review the inspect-image.py script from Module 2, where an image is provided directly in the prompt using image_url.
Discuss: In the context of inspect-image.py, could you consider the ability to process the image_url as a built-in "tool" of the multimodal model itself? Why or why not?
How does providing the image directly differ from defining an external function like describe_image_from_url(url: str) that the LLM would have to explicitly call?