Demystifying the LLM Training with the Fully Open-Source
Moxin 7B Model
Recently, Large Language Models (LLMs) have
undergone a significant transformation, marked by a rapid rise in both their
popularity and capabilities. Open-source LLMs, such as LLaMA and Mistral, have made
great contributions to the ever-increasing popularity of LLMs due to the ease to
customize and deploy the models across various applications. Although LLMs offer
unprecedented opportunities for research and innovation, its commercialization has
raised concerns about transparency, reproducibility, and safety. Many open LLM
models lack the necessary components (such as training code and data) for full
understanding and reproducibility, and some use restrictive licenses whilst claiming
to be “open-source”, which may hinder further innovations on LLMs. To mitigate this
issue, we follow the Model Openness Framework (MOF), a ranked classification system
that rates machine learning models based on their completeness and openness,
following principles of open science, open source, open data, and open access. We
present a truly open source LLM Moxin 7B and release pre-training code and
configurations, training and fine-tuning data, and intermediate and final
checkpoints, aiming to make continuous commitments to fully open-source LLMs.