💡 摘要
此技能使用Kyutai的Pocket TTS将文本转换为语音,支持语音克隆和多种预制声音。
🎯 适合人群
🤖 AI 吐槽: “看起来很能打,但别让配置把人劝退。”
风险:Low。建议检查:是否执行 shell/命令行指令;是否发起外网请求(SSRF/数据外发);文件读写范围与路径穿越风险;依赖锁定与供应链风险。以最小权限运行,并在生产环境启用前审计代码与依赖。
name: text-to-voice description: Convert text to speech using Kyutai's Pocket TTS. Use when the user asks to "generate speech", "text to speech", "TTS", "convert text to audio", "voice synthesis", "generate voice", "read aloud", or "create audio from text". Supports voice cloning from audio samples and multiple pre-made voices (alba, marius, javert, jean, fantine, cosette, eponine, azelma). license: MIT metadata: contributor: Aaron Adetunmbi thanks: kyutai-labs
Text-to-Voice with Kyutai Pocket TTS
Convert text to natural speech using Kyutai's Pocket TTS - a lightweight 100M parameter model that runs efficiently on CPU.
Installation
pip install pocket-tts # or use uvx to run without installing: uvx pocket-tts generate
Requires Python 3.10+ and PyTorch 2.5+. GPU not required.
CLI Usage
Basic Generation
# Generate with defaults (saves to ./tts_output.wav) uvx pocket-tts generate # Specify text pocket-tts generate --text "Hello, this is my message." # Specify output file location pocket-tts generate --text "Hello" --output-path ./audio/greeting.wav # Full example with all common options pocket-tts generate \ --text "Welcome to the demo." \ --voice alba \ --output-path ./output/welcome.wav
CLI Options
| Option | Default | Description |
|--------|---------|-------------|
| --text | "Hello world..." | Text to convert to speech |
| --voice | alba | Voice name, local file path, or HuggingFace URL |
| --output-path | ./tts_output.wav | Where to save the generated audio file |
| --temperature | 0.7 | Generation temperature (higher = more expressive) |
| --lsd-decode-steps | 1 | Quality steps (higher = better quality, slower) |
| --eos-threshold | -4.0 | End detection threshold (lower = finish earlier) |
| --frames-after-eos | auto | Extra frames after end (each frame = 80ms) |
| --device | cpu | Device to use (cpu/cuda) |
| -q, --quiet | false | Disable logging output |
Voice Selection (CLI)
# Use a pre-made voice by name pocket-tts generate --voice alba --text "Hello" pocket-tts generate --voice javert --text "Hello" # Use a local audio file for voice cloning pocket-tts generate --voice ./my_voice.wav --text "Hello" # Use a voice from HuggingFace pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/merchant.wav" --text "Hello"
Quality Tuning (CLI)
# Higher quality (more generation steps) pocket-tts generate --lsd-decode-steps 5 --temperature 0.5 --output-path high_quality.wav # More expressive/varied output pocket-tts generate --temperature 1.0 --output-path expressive.wav # Shorter output (finishes speaking earlier) pocket-tts generate --eos-threshold -3.0 --output-path shorter.wav
Local Web Server
For quick iteration with multiple voices/texts:
uvx pocket-tts serve # Open http://localhost:8000
Available Voices
Pre-made voices (use name directly with --voice):
| Voice | Gender | License | Description |
|-------|--------|---------|-------------|
| alba | Female | CC BY 4.0 | Casual voice |
| marius | Male | CC0 | Voice donation |
| javert | Male | CC0 | Voice donation |
| jean | Male | CC-NC | EARS dataset |
| fantine | Female | CC BY 4.0 | VCTK dataset |
| cosette | Female | CC-NC | Expresso dataset |
| eponine | Female | CC BY 4.0 | VCTK dataset |
| azelma | Female | CC BY 4.0 | VCTK dataset |
Full voice catalog: https://huggingface.co/kyutai/tts-voices
For detailed voice information, see references/voices.md.
Voice Cloning
Clone any voice from an audio sample. For best results:
- Use clean audio (minimal background noise)
- 10+ seconds recommended
- Consider Adobe Podcast Enhance to clean samples
pocket-tts generate --voice ./my_recording.wav --text "Hello" --output-path cloned.wav
Output Format
- Sample Rate: 24kHz
- Channels: Mono
- Format: 16-bit PCM WAV
- Default location:
./tts_output.wav
Python API
For programmatic use:
from pocket_tts import TTSModel import scipy.io.wavfile tts_model = TTSModel.load_model() voice_state = tts_model.get_state_for_audio_prompt("alba") audio = tts_model.generate_audio(voice_state, "Hello world!") # Save to specific location scipy.io.wavfile.write("./audio/output.wav", tts_model.sample_rate, audio.numpy())
TTSModel.load_model()
model = TTSModel.load_model( variant="b6369a24", # Model variant temp=0.7, # Temperature (0.0-1.0) lsd_decode_steps=1, # Generation steps noise_clamp=None, # Max noise value eos_threshold=-4.0 # End-of-sequence threshold )
Voice State
# Pre-made voice voice_state = model.get_state_for_audio_prompt("alba") # Local file voice_state = model.get_state_for_audio_prompt("./my_voice.wav") # HuggingFace voice_state = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")
Generate Audio
audio = model.generate_audio(voice_state, "Text to speak") # Returns: torch.Tensor (1D)
Streaming
for chunk in model.generate_audio_stream(voice_state, "Long text..."): # Process each chunk as it's generated pass
Properties
model.sample_rate- 24000 Hzmodel.device- "cpu" or "cuda"
Performance
- ~200ms latency to first audio chunk
- ~6x real-time on MacBook Air M4 CPU
- Uses only 2 CPU cores
Limitations
- English only
- No built-in pause/silence control
优点
- 支持多种声音和语音克隆
- 轻量且高效,适用于CPU使用
- 使用清晰的CLI界面,易于使用
- 灵活的质量和输出选项
缺点
- 仅限于英语语言
- 没有内置的暂停或静音控制
- 需要特定的Python和PyTorch版本
- 语音克隆可能需要清晰的音频
相关技能
免责声明:本内容来源于 GitHub 开源项目,仅供展示和评分分析使用。
版权归原作者所有 kenneropia.
