Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-500M Model

The model SmolVLM-500M-Instruct is a state-of-the-art, compact model with 500 million parameters. Despite its relatively small size, its capabilities are remarkably impressive.

Let’s jump to the code:

<span>import</span> <span>torch</span>
<span>from</span> <span>transformers</span> <span>import</span> <span>AutoProcessor</span><span>,</span> <span>AutoModelForVision2Seq</span>
<span>from</span> <span>PIL</span> <span>import</span> <span>Image</span>
<span>import</span> <span>warnings</span>
<span>warnings</span><span>.</span><span>filterwarnings</span><span>(</span><span>"</span><span>ignore</span><span>"</span><span>,</span> <span>message</span><span>=</span><span>"</span><span>Some kwargs in processor config are unused</span><span>"</span><span>)</span>
<span>def</span> <span>upload_and_describe_image</span><span>(</span><span>image_path</span><span>):</span>
<span>processor</span> <span>=</span> <span>AutoProcessor</span><span>.</span><span>from_pretrained</span><span>(</span><span>"</span><span>HuggingFaceTB/SmolVLM-500M-Instruct</span><span>"</span><span>)</span>
<span>model</span> <span>=</span> <span>AutoModelForVision2Seq</span><span>.</span><span>from_pretrained</span><span>(</span><span>"</span><span>HuggingFaceTB/SmolVLM-500M-Instruct</span><span>"</span><span>)</span>
<span>image</span> <span>=</span> <span>Image</span><span>.</span><span>open</span><span>(</span><span>image_path</span><span>)</span>
<span>prompt</span> <span>=</span> <span>"</span><span>Describe the content of this <image> in detail, give only answers in a form of text</span><span>"</span>
<span>inputs</span> <span>=</span> <span>processor</span><span>(</span><span>text</span><span>=</span><span>[</span><span>prompt</span><span>],</span> <span>images</span><span>=</span><span>[</span><span>image</span><span>],</span> <span>return_tensors</span><span>=</span><span>"</span><span>pt</span><span>"</span><span>)</span>
<span>with</span> <span>torch</span><span>.</span><span>no_grad</span><span>():</span>
<span>outputs</span> <span>=</span> <span>model</span><span>.</span><span>generate</span><span>(</span>
<span>pixel_values</span><span>=</span><span>inputs</span><span>[</span><span>"</span><span>pixel_values</span><span>"</span><span>],</span>
<span>input_ids</span><span>=</span><span>inputs</span><span>[</span><span>"</span><span>input_ids</span><span>"</span><span>],</span>
<span>attention_mask</span><span>=</span><span>inputs</span><span>[</span><span>"</span><span>attention_mask</span><span>"</span><span>],</span>
<span>max_new_tokens</span><span>=</span><span>150</span><span>,</span>
<span>do_sample</span><span>=</span><span>True</span><span>,</span>
<span>temperature</span><span>=</span><span>0.7</span>
<span>)</span>
<span>description</span> <span>=</span> <span>processor</span><span>.</span><span>batch_decode</span><span>(</span><span>outputs</span><span>,</span> <span>skip_special_tokens</span><span>=</span><span>True</span><span>)[</span><span>0</span><span>]</span>
<span>return</span> <span>description</span><span>.</span><span>strip</span><span>()</span>
<span>if</span> <span>__name__</span> <span>==</span> <span>"</span><span>__main__</span><span>"</span><span>:</span>
<span>image_path</span> <span>=</span> <span>"</span><span>images/bender.jpg</span><span>"</span>
<span>try</span><span>:</span>
<span>description</span> <span>=</span> <span>upload_and_describe_image</span><span>(</span><span>image_path</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Image Description:</span><span>"</span><span>,</span> <span>description</span><span>)</span>
<span>except</span> <span>Exception</span> <span>as</span> <span>e</span><span>:</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>An error occurred: </span><span>{</span><span>e</span><span>}</span><span>"</span><span>)</span>
<span>import</span> <span>torch</span>
<span>from</span> <span>transformers</span> <span>import</span> <span>AutoProcessor</span><span>,</span> <span>AutoModelForVision2Seq</span>
<span>from</span> <span>PIL</span> <span>import</span> <span>Image</span>
<span>import</span> <span>warnings</span>

<span>warnings</span><span>.</span><span>filterwarnings</span><span>(</span><span>"</span><span>ignore</span><span>"</span><span>,</span> <span>message</span><span>=</span><span>"</span><span>Some kwargs in processor config are unused</span><span>"</span><span>)</span>

<span>def</span> <span>upload_and_describe_image</span><span>(</span><span>image_path</span><span>):</span>
    <span>processor</span> <span>=</span> <span>AutoProcessor</span><span>.</span><span>from_pretrained</span><span>(</span><span>"</span><span>HuggingFaceTB/SmolVLM-500M-Instruct</span><span>"</span><span>)</span>
    <span>model</span> <span>=</span> <span>AutoModelForVision2Seq</span><span>.</span><span>from_pretrained</span><span>(</span><span>"</span><span>HuggingFaceTB/SmolVLM-500M-Instruct</span><span>"</span><span>)</span>

    <span>image</span> <span>=</span> <span>Image</span><span>.</span><span>open</span><span>(</span><span>image_path</span><span>)</span>

    <span>prompt</span> <span>=</span> <span>"</span><span>Describe the content of this <image> in detail, give only answers in a form of text</span><span>"</span>
    <span>inputs</span> <span>=</span> <span>processor</span><span>(</span><span>text</span><span>=</span><span>[</span><span>prompt</span><span>],</span> <span>images</span><span>=</span><span>[</span><span>image</span><span>],</span> <span>return_tensors</span><span>=</span><span>"</span><span>pt</span><span>"</span><span>)</span>

    <span>with</span> <span>torch</span><span>.</span><span>no_grad</span><span>():</span>
        <span>outputs</span> <span>=</span> <span>model</span><span>.</span><span>generate</span><span>(</span>
            <span>pixel_values</span><span>=</span><span>inputs</span><span>[</span><span>"</span><span>pixel_values</span><span>"</span><span>],</span>
            <span>input_ids</span><span>=</span><span>inputs</span><span>[</span><span>"</span><span>input_ids</span><span>"</span><span>],</span>
            <span>attention_mask</span><span>=</span><span>inputs</span><span>[</span><span>"</span><span>attention_mask</span><span>"</span><span>],</span>
            <span>max_new_tokens</span><span>=</span><span>150</span><span>,</span>
            <span>do_sample</span><span>=</span><span>True</span><span>,</span>
            <span>temperature</span><span>=</span><span>0.7</span>
        <span>)</span>

    <span>description</span> <span>=</span> <span>processor</span><span>.</span><span>batch_decode</span><span>(</span><span>outputs</span><span>,</span> <span>skip_special_tokens</span><span>=</span><span>True</span><span>)[</span><span>0</span><span>]</span>
    <span>return</span> <span>description</span><span>.</span><span>strip</span><span>()</span>

<span>if</span> <span>__name__</span> <span>==</span> <span>"</span><span>__main__</span><span>"</span><span>:</span>
    <span>image_path</span> <span>=</span> <span>"</span><span>images/bender.jpg</span><span>"</span>

    <span>try</span><span>:</span>
        <span>description</span> <span>=</span> <span>upload_and_describe_image</span><span>(</span><span>image_path</span><span>)</span>
        <span>print</span><span>(</span><span>"</span><span>Image Description:</span><span>"</span><span>,</span> <span>description</span><span>)</span>
    <span>except</span> <span>Exception</span> <span>as</span> <span>e</span><span>:</span>
        <span>print</span><span>(</span><span>f</span><span>"</span><span>An error occurred: </span><span>{</span><span>e</span><span>}</span><span>"</span><span>)</span>
import torch from transformers import AutoProcessor, AutoModelForVision2Seq from PIL import Image import warnings warnings.filterwarnings("ignore", message="Some kwargs in processor config are unused") def upload_and_describe_image(image_path): processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct") model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct") image = Image.open(image_path) prompt = "Describe the content of this <image> in detail, give only answers in a form of text" inputs = processor(text=[prompt], images=[image], return_tensors="pt") with torch.no_grad(): outputs = model.generate( pixel_values=inputs["pixel_values"], input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=150, do_sample=True, temperature=0.7 ) description = processor.batch_decode(outputs, skip_special_tokens=True)[0] return description.strip() if __name__ == "__main__": image_path = "images/bender.jpg" try: description = upload_and_describe_image(image_path) print("Image Description:", description) except Exception as e: print(f"An error occurred: {e}")

Enter fullscreen mode Exit fullscreen mode

This Python script uses the Hugging Face Transformers library to generate a textual description of an image. It loads a pre-trained vision-to-sequence model and processor, processes an input image, and generates a descriptive text based on the image content. The script handles exceptions and prints the generated description.

You can download it here: https://github.com/alexander-uspenskiy/vlm

Based on this original non-stock image (put it to the image directory of the project):

Take a look at the description generated by the model (you can play with the prompt and parameters in the code to format the output better for any propose): The robot is sitting on a couch. It has eyes and mouth. He is reading something. He is holding a book with his hands. He is looking at the book. In the background, there are books in a shelf. Behind the books, there is a wall and a door. At the bottom of the image, there is a chair. The chair is white. The chair has a cushion on it. In the background, the wall is brown. The floor is grey. in the image, the robot is silver and cream color. The book is brown. The book is open. The robot is holding the book with both hands. The robot is looking at the book. The robot is sitting on the couch.

It looks excellent, and the model is both fast and resource-efficient compared to LLMs.

Happy coding!

原文链接:Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-500M Model

© 版权声明
THE END
喜欢就支持一下吧
点赞10 分享
If you never chase your dream, you will never catch them.
若不去追逐梦想,你将永远无法抓住梦想
评论 抢沙发

请登录后发表评论

    暂无评论内容