ByxMAD.ai
Dec 04 2024
Out of the numerous open-source LLMs (Large Language Models) currently available, the standout leader regarding performance, complexity, and accuracy has been the Llama family of models. Developed by Meta, ever since the first model was released in February 2023, each successive Llama model has increased in popularity and performance (arXiv:2302.13971).
Though Llama was initially considered inferior to its closed-source counterparts, such as ChatGPT and Gemini, it is now considered the superior alternative. Llama boasts the nuanced answers and complex features offered by OpenAI’s and Google’s best models and provides a degree of flexibility and customization that a closed-source model could not match.
However, this increase in performance comes with a tradeoff. Amongst developers and researchers, Llama has been as well known for its notoriously large memory requirements as it has been for its performance. Even the smaller Llama models have GPU memory requirements that exceed those of most personal computers and thus require custom hardware. The more nuanced models, such as Llama-3.1-405B, require even more memory, needing at least 16 H100 GPUs, which sets the hardware barriers to entry for these models at hundreds of thousands of dollars.
Thus, there has been a recent push to compress these models for casual use, with many researchers, grassroots developers, and companies developing smaller models that still boast the same performance and accuracy as the leading models.
Meta has recently introduced a smaller version of their own Llama: Lightweight Llama Models. These are quantized versions of Meta’s smallest models, Llama 1B and 3B, compressed to become small enough to run on CPUs, including popular mobile devices. With these quantized models, Meta claims to achieve:
These results were accomplished through Meta’s immense resources, training data, full evaluations, and safety protocols.
Meta has accomplished their level of compression with a series of developments culminating in quantized Llama. For instance:
This new development is exciting and will undoubtedly inspire greater innovation from Meta and others. However, these quantization methods have some clear limitations:
CPU Deployment Only: Lightweight Llama is specifically limited to mobile CPUs, making it unsuitable for large-scale applications and projects.
Limited Model Support: Current quantization applies only to the open-sourced Llama 3.2 1B and 3B models, leaving larger models unsupported.
These limitations make the current quantization unusable for industries relying on larger, more complex models.
We at xMAD.ai have also pondered the question of quantization for quite some time. After years of research and development, we offer a solid alternative to Lightweight Llama: a solution that is:
While Lightweight Llama’s CPU compression constricts its use to mobile phones and small projects, xMAD’s GPU-focused quantization expands the possibilities to:
Our focus enables those who wish to innovate with LLMs but are constrained by the rapidly increasing GPU requirements of leading LLMs.
Unlike Meta’s limitation to small models, xMAD.ai’s quantization methodology can be applied to:
This flexibility allows developers, researchers, and companies to deploy models tailored to their needs.
As shown by our benchmarks and our research papers (three of which were accepted into NeurIPS), no other company or research group achieves the same levels of memory reduction while maintaining accuracy.
Our unparalleled memory reduction aggressively increases accessibility to the largest and most capable models, previously limited to those owning large, often unavailable, hardware.
Want to test the capabilities of xMAD’s models? Visit our HuggingFace profile to explore nine models we currently offer.
If you want to unlock the full potential of xMAD’s quantization capabilities, contact us at support@xmad.ai.