ByxMAD.ai
Nov 01 2024
You asked for efficiency. We delivered. Our xMADified Gemma 2 (9B) model is here, quantized from 16-bit floats down to a lean, mean 4-bit machine. With our proprietary tech, the xMADified Gemma 2 brings you accuracy, memory efficiency, and fine-tuning ease at a fraction of the VRAM you’d expect. Now, even with an 8 GB footprint, you can get top-tier performance on a 12 GB GPU—no need for the latest or greatest hardware.
1. Accurate, Efficient, Reliable: Our quantization process keeps the model highly accurate across benchmarks. You’re getting the best quantized Gemma 2 (9B) model available, with all the power in just half the space.
2. 8 GB VRAM for Easy Deployment: Unlike the standard 18.5 GB, our 8 GB xMADified model can run smoothly on accessible, modest GPUs, making it possible to achieve high performance without VRAM-heavy setups.
3. Three-click Fine-tuning: Fine-tuning’s a breeze on xMADified—no elaborate setup or massive hardware needed. In a few clicks, you’re ready to customize your model.
Our xMADified Gemma 2 outperforms the competition on multiple evaluation benchmarks. Take a look at the table below to see why developers are choosing xMAD for accuracy and efficiency.
Loading the model checkpoint of this xMADified model requires around 8 GB of VRAM. Hence it can be efficiently run on a 12 GB GPU.
Install Package Requirement:
pip install torch==2.4.0
# Run following if you have CUDA version 11.8: pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate optimum
pip install -vvv --no-build-isolation "git+https://github.com/PanQiWei/AutoGPTQ.git@v0.7.1"
Sample Inference Code
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_id = "xmadai/gemma-2-9b-it-xMADai-INT4"
prompt = [
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
{"role": "user", "content": "What's Deep Learning?"},
]
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
model = AutoGPTQForCausalLM.from_quantized(
model_id,
device_map='auto',
trust_remote_code=True,
)
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=1024)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
Curious to see what xMADified can do for your projects? Head over to our Hugging Face repository, where we’ve prepared a Colab Notebook so you can get started in minutes.
For any questions, help, or additional xMADified models, feel free to reach out at support@xmad.ai
Be sure follow us on Linkedin and Huggingface.
xMAD.ai Team