Meta reveals the open source Llama 3.2: Aye and fits your pocket

Meta reveals the open source Llama 3.2: Aye and fits your pocket


It's been a good week for open source AI.

On Wednesday, Meta announced an upgrade to the modern big language model Llama 3.2, and it doesn't just talk — it sees.

More interestingly, some versions can be embedded into your smartphone without losing quality, meaning you can have personal local AI interactions, apps and customization without sending your data to third-party servers.

Llama 3.2, which was unveiled Wednesday by MetaConnect, comes in four flavors, each packing a different punch. Its heavyweight competitors—the 11B and 90B gauge models—flex their muscles in both text and image processing capabilities.

Ledger

They can perform complex tasks such as analyzing charts, captioning images, and annotating objects on images based on natural language descriptions.

Llama 3.2 was the best open source multimodal vision LLM in our tests, comparing it to GPT-4o, Claude 3.5 Sonnet and Reka Core, the Allen Institute announced the same week.

Zuckerberg's company also introduced two new fly scale champions: a pair of 1B and 3B parameter models designed for efficiency, speed, and limited but repetitive tasks that don't require a lot of computation.

These small models are multilingual text masters, with “tool calling” capabilities, meaning they can be better integrated with programming tools. Despite their small size, like GPT4o and other powerful models, they boast an impressive 128K token context window – making them ideal for on-device summarization, tracking and rewrite operations.

To make this happen, the Metaengineering team pulled off some serious digital gymnastics. First, they used structured pruning to prune redundant information from large models, then used knowledge consolidation—transferring knowledge from large models to smaller ones—to squeeze in more smarts.

The result was a collection of compact models that outperform competitors in their weight class, including Google's Gemma 2.6B and Microsoft's Phi-2 2.7B in a variety of metrics.

Meta is working hard to improve AI on devices. They have teamed up with hardware titans Qualcomm, MediaTek and Arm to ensure that Lama 3.2 plays well with mobile chips from day one. Cloud computing giants aren't left out either—AWS, Google Cloud, Microsoft Azure, and a host of others are offering quick access to the new models on their platforms.

Under the hood, the Lama 3.2's visual capabilities come from smart architectural tweaks. Metaengineers baked into the current language model with adaptive weights, creating a bridge between pre-trained image encoders and text processing cores.

In other words, the model's visual capabilities do not come at the expense of its text processing capabilities, so users can expect the same or better text results compared to Lama 3.1.

The Llama 3.2 release is open source—at least by meta standards. Meta is making the models available for download on Llama.com and Hugging Face, as well as their wider partner ecosystem.

Those interested in running on the cloud can use their own Google Cola notebook or use Grok for text-based interactions, which can generate around 5000 tokens in less than 3 seconds.

Screenshot 2024 09 26 153409

Driving a llama

We put Llama 3.2 through its paces, quickly testing its capabilities on a variety of tasks.

In text-based communications, the model is equivalent to the previous ones. However, his coding skills yielded mixed results.

When tested on Groq's platform, Lama 3.2 successfully generated code for popular games and simple programs. Yet, the tiny 70B model stumbled when asked to generate functional code for a custom game we created. The more powerful 90B, however, was very efficient and produced a practical game on the first try.

Screenshot 2024 09 26 153851

You can see the full code generated by Lama-3.2 and other models we tested by clicking this link.

Identifying patterns and objects in images

Screenshot 2024 09 26 142704

Llama 3.2 excels at identifying objects in images. When presented with a futuristic cyberpunk-style image and asked if it fit the Steampunk aesthetic, the model accurately identified its style and components. He offered a satisfactory explanation, noting that the image did not conform to Steampunk due to the lack of key elements associated with the genre.

Chart analysis (and SD image recognition)

Screenshot 2024 09 26 142928

Chart analysis is another strong suit for Lama 3.2, although it requires high-resolution images for optimal performance. When we input a screenshot containing a chart that other models like Molmo or Reka can interpret, the llama's visual abilities are impaired. The model apologized, explaining that he couldn't read the letters properly due to image quality.

Text in image tag

Screenshot 2024 09 26 143556

While the Llama 3.2 struggled with small text in our charts, it performed flawlessly when reading text with large images. We showed a presentation slide introducing a person, and the model successfully understood the context, distinguishing the difference between a name and a job role without any errors.

The verdict

Overall, Llama 3.2 is a huge improvement over the previous generation and a great addition to the open source AI industry. Its strengths are in image interpretation and large-text recognition, with some areas that can be improved, especially in processing low-resolution images and solving complex, custom coding tasks.

The promise of cross-device compatibility bodes well for the future of personal and local AI functions and is a great balance to close deals with the likes of the Gemini Nano and Apple's proprietary models.

Edited by Josh Quittner and Sebastian Sinclair

Generally intelligent newspaper

A weekly AI journey narrated by General AI Model.

Pin It on Pinterest