Meet the xMADified Gemma 2 (9B): High Performance, Minimal VRAM 🚀

Nov 01 2024

You asked for efficiency. We delivered. Our xMADified Gemma 2 (9B) model is here, quantized from 16-bit floats down to a lean, mean 4-bit machine. With our proprietary tech, the xMADified Gemma 2 brings you accuracy, memory efficiency, and fine-tuning ease at a fraction of the VRAM you’d expect. Now, even with an 8 GB footprint, you can get top-tier performance on a 12 GB GPU—no need for the latest or greatest hardware.

Why Choose the xMADified Model?

1. Accurate, Efficient, Reliable: Our quantization process keeps the model highly accurate across benchmarks. You’re getting the best quantized Gemma 2 (9B) model available, with all the power in just half the space.

2. 8 GB VRAM for Easy Deployment: Unlike the standard 18.5 GB, our 8 GB xMADified model can run smoothly on accessible, modest GPUs, making it possible to achieve high performance without VRAM-heavy setups.

3. Three-click Fine-tuning: Fine-tuning’s a breeze on xMADified—no elaborate setup or massive hardware needed. In a few clicks, you’re ready to customize your model.

xMADified vs. Hugging Quants: Benchmarks Speak Louder Than Words

Our xMADified Gemma 2 outperforms the competition on multiple evaluation benchmarks. Take a look at the table below to see why developers are choosing xMAD for accuracy and efficiency.

How to Run Model

Loading the model checkpoint of this xMADified model requires around 8 GB of VRAM. Hence it can be efficiently run on a 12 GB GPU.

Install Package Requirement:

pip install torch==2.4.0

# Run following if you have CUDA version 11.8: pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118 pip install transformers accelerate optimum pip install -vvv --no-build-isolation "git+https://github.com/PanQiWei/AutoGPTQ.git@v0.7.1"

Sample Inference Code

from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM model_id = "xmadai/gemma-2-9b-it-xMADai-INT4" prompt = [ {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."}, {"role": "user", "content": "What's Deep Learning?"}, ] tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False) inputs = tokenizer.apply_chat_template( prompt, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True, ).to("cuda") model = AutoGPTQForCausalLM.from_quantized( model_id, device_map='auto', trust_remote_code=True, ) outputs = model.generate(**inputs, do_sample=True, max_new_tokens=1024) print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Try xMADified Gemma 2 Today!

Curious to see what xMADified can do for your projects? Head over to our Hugging Face repository, where we’ve prepared a Colab Notebook so you can get started in minutes.

For any questions, help, or additional xMADified models, feel free to reach out at support@xmad.ai

Be sure follow us on Linkedin and Huggingface.

Thank you for reading!

xMAD.ai Team

Popular Tags :

model-release