Meet the xMADified Gemma 2 (9B): High Performance, Minimal VRAM 🚀

image

You asked for efficiency. We delivered. Our xMADified Gemma 2 (9B) model is here, quantized from 16-bit floats down to a lean, mean 4-bit machine. With our proprietary tech, the xMADified Gemma 2 brings you accuracy, memory efficiency, and fine-tuning ease at a fraction of the VRAM you’d expect. Now, even with an 8 GB footprint, you can get top-tier performance on a 12 GB GPU—no need for the latest or greatest hardware.

Why Choose the xMADified Model?

1. Accurate, Efficient, Reliable: Our quantization process keeps the model highly accurate across benchmarks. You’re getting the best quantized Gemma 2 (9B) model available, with all the power in just half the space.

2. 8 GB VRAM for Easy Deployment: Unlike the standard 18.5 GB, our 8 GB xMADified model can run smoothly on accessible, modest GPUs, making it possible to achieve high performance without VRAM-heavy setups.

3. Three-click Fine-tuning: Fine-tuning’s a breeze on xMADified—no elaborate setup or massive hardware needed. In a few clicks, you’re ready to customize your model.

xMADified vs. Hugging Quants: Benchmarks Speak Louder Than Words

Our xMADified Gemma 2 outperforms the competition on multiple evaluation benchmarks. Take a look at the table below to see why developers are choosing xMAD for accuracy and efficiency.

Blog image

How to Run Model

Loading the model checkpoint of this xMADified model requires around 8 GB of VRAM. Hence it can be efficiently run on a 12 GB GPU.

Install Package Requirement:

pip install torch==2.4.0

# Run following if you have CUDA version 11.8: pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate optimum
pip install -vvv --no-build-isolation "git+https://github.com/PanQiWei/AutoGPTQ.git@v0.7.1"

Sample Inference Code

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_id = "xmadai/gemma-2-9b-it-xMADai-INT4"
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
model = AutoGPTQForCausalLM.from_quantized(
model_id,
device_map='auto',
trust_remote_code=True,
)
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=1024)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Try xMADified Gemma 2 Today!

Curious to see what xMADified can do for your projects? Head over to our Hugging Face repository, where we’ve prepared a Colab Notebook so you can get started in minutes.

For any questions, help, or additional xMADified models, feel free to reach out at support@xmad.ai

Be sure follow us on Linkedin and Huggingface.

Thank you for reading!

xMAD.ai Team

Popular Tags :
Share this post :