100 Days of Gen AI: Day 40

February 02, 2025

Let’s try running the new DeepSeek-R1 model locally! The model is too large for my M4 macbook, but I will try to run a much smaller distilled model.

I’m using this post from Simon Willison’s blog as a starting point.

I downloaded ollama: https://ollama.com/download

Then I ran: ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0. I’m not on the fastest internet connection right now, so this is going to take 2 hours to complete. In the meantime I did some reading.

I stumbled upon the Huggingface Auto-train feature and Unsloth for fine tuning. Let’s put in a pin in those for the future.

The blog post also mentions this browser-based way to run a model: webml-community/deepseek-r1-webgpu. Pretty cool!

As I read about unsloth’s FAQ + Is Fine-tuning Right For Me?, the immediate work-related use case that comes to mind is using a fine-tuned model to determine whether pending terraform plans across our 650 workspaces are safe to apply. If it could run as a post-plan hook, and then tag the release with something like “AI thinks this change is safe” or “ai thinks this change needs close review.” Of course we’d have to think of some fun name for this system other than “ai.” :)

I also poked around https://huggingface.co/spaces.

Ok, ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0 finally completed!

Let’s try asking it to generate a webserver in Go that serves a page that says “hello world.” Eh, it generates sensible looking Go code, but it doesn’t specify setting up a go.mod file, some of the syntax spacing is off, and the server fails to run.

Let’s try a different prompt.

stendec@M4 ~ % ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0
>>> what is a quantized ai model?
<think>
Okay, so I'm trying to understand what a quantized AI model is. From the initial explanation, it seems like it's
related to converting traditional models into a more efficient form, probably using something called quantization.

First, I know that AI models, especially deep learning ones like neural networks, can be really computationally
heavy. They have a lot of parameters that require a lot of processing power and memory. This makes training and
deploying them expensive and resource-intensive. Quantization must be some kind of technique to make these models
more efficient.

I remember hearing about quantizing tensors in machine learning. Tensors are like multi-dimensional arrays, right?
So maybe quantizing the model involves reducing each tensor's precision or size somehow. But I'm not entirely sure
how that works.

The term "quantized" might come from the concept of quantization itself, which in mathematics is about mapping a
continuous range of values to a discrete set of values. In machine learning, this could mean taking real-number
weights and converting them into integers with a specific scale, like 8-bit integers as mentioned. That would
definitely reduce the memory footprint since each number takes up less space.

So if I have a model with, say, 100 million parameters each as 32-bit floats, that's a lot of bytes. Converting
those to 8-bit quantized integers would make each parameter take up only one byte instead of four. That's an 8x
reduction in memory per parameter. For large models, this could save a significant amount of space and computational
resources.

I wonder how the accuracy is affected by quantization. If you reduce precision, wouldn't that make the model less
accurate? The initial explanation mentioned it can affect performance but allows for faster inference. Maybe the
trade-off is worth it in some scenarios where speed or resource efficiency is more important than maximum accuracy.

The mention of frameworks supporting quantization, like TensorFlow Lite and PyTorch with their respective
quantization tools, makes sense. These frameworks are optimized for performance and often provide optimizations that
help deploy models efficiently on various platforms, including mobile devices where resources are limited.

Wait, the explanation also mentioned dynamic quantization, which adjusts based on input data. That's interesting
because it could potentially offer better performance under certain conditions but might be more complex to
implement. Static quantization is more straightforward and perhaps easier to use consistently across different
inputs.

I'm curious about how common quantization is in production environments. It would make sense for applications where
real-time processing is needed, like in autonomous vehicles or mobile apps, where the model needs to run quickly and
efficiently without draining too much battery power or using up too much bandwidth.

Another thought: when you quantize a model, does it require retraining? Or can you just convert a pre-trained model?
I think some quantization techniques allow for converting models without retraining by adjusting the weights based
on the original scale. But maybe dynamic quantization needs some adaptation over time as the model interacts with
different inputs.

I should also consider how this affects inference speed. If each layer's computation is faster because the tensors
are smaller, the overall processing time could be reduced significantly. This would be beneficial for edge computing
applications where you want to get predictions out quickly without a lot of delay.

In summary, quantizing an AI model seems to involve reducing the precision or size of the model's weights and
tensors to make it more efficient in terms of computation and memory usage. It's part of a broader trend towards
making AI models more deployable and accessible, especially in environments where resources are constrained.
</think>

**Quantized AI Model: An Overview**

A quantized AI model is a technique used to enhance the efficiency of traditional deep learning models by reducing
their computational and memory requirements. This process involves converting the model's weights and tensors from
higher precision formats, such as 32-bit floats, to lower precision formats, often using 8-bit integers. This
reduction in precision significantly lowers memory consumption and accelerates inference speed.

**Key Concepts:**

1. **Precision Reduction:** Quantization maps real-number weights to integer values with a specific scale (e.g., 8
bits), which reduces the size of each parameter from four bytes to one byte, thus saving memory and computational
resources.

2. **Impact on Accuracy:** While quantization can affect model accuracy, it often strikes a balance by allowing
faster inference, which is crucial in resource-constrained environments where speed may be more critical than
maximal accuracy.

3. **Frameworks and Tools:** Major AI frameworks like TensorFlow Lite and PyTorch provide tools for quantization,
facilitating efficient deployment on platforms with limited resources, such as mobile devices or edge computing
systems.

4. **Dynamic vs. Static Quantization:**
   - **Static Quantization:** Converts models without retraining, adjusting weights based on original scales.
   - **Dynamic Quantization:** Adapts to input data over time, offering potential performance benefits but requiring
more complex implementation.

5. **Applications:** Particularly useful in scenarios requiring real-time processing, such as autonomous vehicles or
mobile apps, where efficiency and resource utilization are paramount.

6. **Inference Speed:** The smaller tensor sizes lead to faster computations, enhancing overall processing speed,
which is vital for edge computing applications aiming for quick predictions.

**Conclusion:** Quantizing AI models is a strategic approach to optimize performance and deployment in
resource-sensitive environments, aligning with the broader goal of making advanced AI technologies more accessible
and efficient.

Color me impressed.

Newer Older