Kimi-Audio-7B-Instruct 🤗 | 📑 Paper
We present Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation. This repository contains the official implementation, models, and evaluation toolkit for Kimi-Audio.
Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
Kimi-Audio consists of three main components:
This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn.
import soundfile as sf
from kimia_infer.api.kimia import KimiAudio
# --- 1. Load Model ---
model_path = "moonshotai/Kimi-Audio-7B-Instruct"
model = KimiAudio(model_path=model_path, load_detokenizer=True)
# --- 2. Define Sampling Parameters ---
sampling_params = {
"audio_temperature": 0.8,
"audio_top_k": 10,
"text_temperature": 0.0,
"text_top_k": 5,
"audio_repetition_penalty": 1.0,
"audio_repetition_window_size": 64,
"text_repetition_penalty": 1.0,
"text_repetition_window_size": 16,
}
# --- 3. Example 1: Audio-to-Text (ASR) ---
messages_asr = [
# You can provide context or instructions as text
{"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
# Provide the audio file path
{"role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"}
]
# Generate only text output
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output) # Expected output: "这并不是告别,这是一个篇章的结束,也是新篇章的开始。"
# --- 4. Example 2: Audio-to-Audio/Text Conversation ---
messages_conversation = [
# Start conversation with an audio query
{"role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"}
]
# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")
# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "A."
print("Kimi-Audio inference examples complete.")
Kimi-Audio achieves state-of-the-art (SOTA) performance across a wide range of audio benchmarks.
The below is the overall performance:
Here are performances on different benchmarks, you can easily reproduce the our results and baselines by our Kimi-Audio-Evalkit (also see Evaluation Toolkit):
| Datasets | Model | Performance (WER↓) |
|---|---|---|
| LibriSpeech test-clean | test-other | Qwen2-Audio-base | 1.74 | 4.04 |
| Baichuan-base | 3.02 | 6.04 | |
| Step-Audio-chat | 3.19 | 10.67 | |
| Qwen2.5-Omni | 2.37 | 4.21 | |
| Kimi-Audio | 1.28 | 2.42 | |
| Fleurs zh | en | Qwen2-Audio-base | 3.63 | 5.20 |
| Baichuan-base | 4.15 | 8.07 | |
| Step-Audio-chat | 4.26 | 8.56 | |
| Qwen2.5-Omni | 2.92 | 4.17 | |
| Kimi-Audio | 2.69 | 4.44 | |
| AISHELL-1 | Qwen2-Audio-base | 1.52 |
| Baichuan-base | 1.93 | |
| Step-Audio-chat | 2.14 | |
| Qwen2.5-Omni | 1.13 | |
| Kimi-Audio | 0.60 | |
| AISHELL-2 ios | Qwen2-Audio-base | 3.08 |
| Baichuan-base | 3.87 | |
| Step-Audio-chat | 3.89 | |
| Qwen2.5-Omni | 2.56 | |
| Kimi-Audio | 2.56 | |
| WenetSpeech test-meeting | test-net | Qwen2-Audio-base | 8.40 | 7.64 |
| Baichuan-base | 13.28 | 10.13 | |
| Step-Audio-chat | 10.83 | 9.47 | |
| Qwen2.5-Omni | 7.71 | 6.04 | |
| Kimi-Audio | 6.28 | 5.37 | |
| Kimi-ASR Internal Testset subset1 | subset2 | Qwen2-Audio-base | 2.31 | 3.24 |
| Baichuan-base | 3.41 | 5.60 | |
| Step-Audio-chat | 2.82 | 4.74 | |
| Qwen2.5-Omni | 1.53 | 2.68 | |
| Kimi-Audio | 1.42 | 2.44 |
| Datasets | Model | Performance↑ |
|---|---|---|
| MMAU music | sound | speech | Qwen2-Audio-base | 58.98 | 69.07 | 52.55 |
| Baichuan-chat | 49.10 | 59.46 | 42.47 | |
| GLM-4-Voice | 38.92 | 43.54 | 32.43 | |
| Step-Audio-chat | 49.40 | 53.75 | 47.75 | |
| Qwen2.5-Omni | 62.16 | 67.57 | 53.92 | |
| Kimi-Audio | 61.68 | 73.27 | 60.66 | |
| ClothoAQA test | dev | Qwen2-Audio-base | 71.73 | 72.63 |
| Baichuan-chat | 48.02 | 48.16 | |
| Step-Audio-chat | 45.84 | 44.98 | |
| Qwen2.5-Omni | 72.86 | 73.12 | |
| Kimi-Audio | 71.24 | 73.18 | |
| VocalSound | Qwen2-Audio-base | 93.82 |
| Baichuan-base | 58.17 | |
| Step-Audio-chat | 28.58 | |
| Qwen2.5-Omni | 93.73 | |
| Kimi-Audio | 94.85 | |
| Nonspeech7k | Qwen2-Audio-base | 87.17 |
| Baichuan-chat | 59.03 | |
| Step-Audio-chat | 21.38 | |
| Qwen2.5-Omni | 69.89 | |
| Kimi-Audio | 93.93 | |
| MELD | Qwen2-Audio-base | 51.23 |
| Baichuan-chat | 23.59 | |
| Step-Audio-chat | 33.54 | |
| Qwen2.5-Omni | 49.83 | |
| Kimi-Audio | 59.13 | |
| TUT2017 | Qwen2-Audio-base | 33.83 |
| Baichuan-base | 27.9 | |
| Step-Audio-chat | 7.41 | |
| Qwen2.5-Omni | 43.27 | |
| Kimi-Audio | 65.25 | |
| CochlScene test | dev | Qwen2-Audio-base | 52.69 | 50.96 |
| Baichuan-base | 34.93 | 34.56 | |
| Step-Audio-chat | 10.06 | 10.42 | |
| Qwen2.5-Omni | 63.82 | 63.82 | |
| Kimi-Audio | 79.84 | 80.99 |
| Datasets | Model | Performance↑ |
|---|---|---|
| OpenAudioBench AlpacaEval | Llama Questions | Reasoning QA | TriviaQA | Web Questions | Qwen2-Audio-chat | 57.19 | 69.67 | 42.77 | 40.30 | 45.20 |
| Baichuan-chat | 59.65 | 74.33 | 46.73 | 55.40 | 58.70 | |
| GLM-4-Voice | 57.89 | 76.00 | 47.43 | 51.80 | 55.40 | |
| StepAudio-chat | 56.53 | 72.33 | 60.00 | 56.80 | 73.00 | |
| Qwen2.5-Omni | 72.76 | 75.33 | 63.76 | 57.06 | 62.80 | |
| Kimi-Audio | 75.73 | 79.33 | 58.02 | 62.10 | 70.20 | |
| VoiceBench AlpacaEval | CommonEval | SD-QA | MMSU | Qwen2-Audio-chat | 3.69 | 3.40 | 35.35 | 35.43 |
| Baichuan-chat | 4.00 | 3.39 | 49.64 | 48.80 | |
| GLM-4-Voice | 4.06 | 3.48 | 43.31 | 40.11 | |
| StepAudio-chat | 3.99 | 2.99 | 46.84 | 28.72 | |
| Qwen2.5-Omni | 4.33 | 3.84 | 57.41 | 56.38 | |
| Kimi-Audio | 4.46 | 3.97 | 63.12 | 62.17 | |
| VoiceBench OpenBookQA | IFEval | AdvBench | Avg | Qwen2-Audio-chat | 49.01 | 22.57 | 98.85 | 54.72 |
| Baichuan-chat | 63.30 | 41.32 | 86.73 | 62.51 | |
| GLM-4-Voice | 52.97 | 24.91 | 88.08 | 57.17 | |
| StepAudio-chat | 31.87 | 29.19 | 65.77 | 48.86 | |
| Qwen2.5-Omni | 79.12 | 53.88 | 99.62 | 72.83 | |
| Kimi-Audio | 83.52 | 61.10 | 100.00 | 76.93 |
| Model | Ability | |||||
|---|---|---|---|---|---|---|
| Speed Control | Accent Control | Emotion Control | Empathy | Style Control | Avg | |
| GPT-4o | 4.21 | 3.65 | 4.05 | 3.87 | 4.54 | 4.06 |
| Step-Audio-chat | 3.25 | 2.87 | 3.33 | 3.05 | 4.14 | 3.33 |
| GLM-4-Voice | 3.83 | 3.51 | 3.77 | 3.07 | 4.04 | 3.65 |
| GPT-4o-mini | 3.15 | 2.71 | 4.24 | 3.16 | 4.01 | 3.45 |
| Kimi-Audio | 4.30 | 3.45 | 4.27 | 3.39 | 4.09 | 3.90 |
Evaluating and comparing audio foundation models is challenging due to inconsistent metrics, varying inference configurations, and a lack of standardized generation evaluation. To address this, we developed and open-sourced an Evaluation Toolkit.
Key features:
We encourage the community to use and contribute to this toolkit to foster more reliable and comparable benchmarking. Find it here: Kimi-Audio-Evalkit.
The model is based and modified from Qwen 2.5-7B. Code derived from Qwen2.5-7B is licensed under the Apache 2.0 License. Other parts of the code are licensed under the MIT License.
We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio:
Thank you to all the open-source projects for their contributions to this project!
If you find Kimi-Audio useful in your research or applications, please cite our technical report:
@misc{kimi_audio_2024, title={Kimi-Audio Technical Report}, author={Kimi Team}, year={2024}, eprint={arXiv:placeholder}, archivePrefix={arXiv}, primaryClass={cs.CL} }
For questions, issues, or collaboration inquiries, please feel free to open an issue on GitHub.