Co-Pilot / 辅助式

更新于 5 months ago

text-to-voice

Name: text-to-voice
Rating: 4.1 (1 reviews)
Author: kenneropia

Kkenneropia

0.0k

kenneropia/text-to-voice

Agent 评分

💡 摘要

此技能使用Kyutai的Pocket TTS将文本转换为语音，支持语音克隆和多种预制声音。

🎯 适合人群

希望为项目添加配音的内容创作者将TTS功能集成到应用程序中的开发人员为学生创建音频材料的教育工作者增强视觉障碍用户内容的可及性倡导者希望快速生成语音内容的播客创作者

🤖 AI 吐槽: “看起来很能打，但别让配置把人劝退。”

安全分析低风险

风险：Low。建议检查：是否执行 shell/命令行指令；是否发起外网请求（SSRF/数据外发）；文件读写范围与路径穿越风险；依赖锁定与供应链风险。以最小权限运行，并在生产环境启用前审计代码与依赖。

name: text-to-voice description: Convert text to speech using Kyutai's Pocket TTS. Use when the user asks to "generate speech", "text to speech", "TTS", "convert text to audio", "voice synthesis", "generate voice", "read aloud", or "create audio from text". Supports voice cloning from audio samples and multiple pre-made voices (alba, marius, javert, jean, fantine, cosette, eponine, azelma). license: MIT metadata: contributor: Aaron Adetunmbi thanks: kyutai-labs

Text-to-Voice with Kyutai Pocket TTS

Convert text to natural speech using Kyutai's Pocket TTS - a lightweight 100M parameter model that runs efficiently on CPU.

Installation

pip install pocket-tts
# or use uvx to run without installing:
uvx pocket-tts generate

Requires Python 3.10+ and PyTorch 2.5+. GPU not required.

CLI Usage

Basic Generation

# Generate with defaults (saves to ./tts_output.wav)
uvx pocket-tts generate

# Specify text
pocket-tts generate --text "Hello, this is my message."

# Specify output file location
pocket-tts generate --text "Hello" --output-path ./audio/greeting.wav

# Full example with all common options
pocket-tts generate \
  --text "Welcome to the demo." \
  --voice alba \
  --output-path ./output/welcome.wav

CLI Options

| Option | Default | Description | |--------|---------|-------------| | --text | "Hello world..." | Text to convert to speech | | --voice | alba | Voice name, local file path, or HuggingFace URL | | --output-path | ./tts_output.wav | Where to save the generated audio file | | --temperature | 0.7 | Generation temperature (higher = more expressive) | | --lsd-decode-steps | 1 | Quality steps (higher = better quality, slower) | | --eos-threshold | -4.0 | End detection threshold (lower = finish earlier) | | --frames-after-eos | auto | Extra frames after end (each frame = 80ms) | | --device | cpu | Device to use (cpu/cuda) | | -q, --quiet | false | Disable logging output |

Voice Selection (CLI)

# Use a pre-made voice by name
pocket-tts generate --voice alba --text "Hello"
pocket-tts generate --voice javert --text "Hello"

# Use a local audio file for voice cloning
pocket-tts generate --voice ./my_voice.wav --text "Hello"

# Use a voice from HuggingFace
pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/merchant.wav" --text "Hello"

Quality Tuning (CLI)

# Higher quality (more generation steps)
pocket-tts generate --lsd-decode-steps 5 --temperature 0.5 --output-path high_quality.wav

# More expressive/varied output
pocket-tts generate --temperature 1.0 --output-path expressive.wav

# Shorter output (finishes speaking earlier)
pocket-tts generate --eos-threshold -3.0 --output-path shorter.wav

Local Web Server

For quick iteration with multiple voices/texts:

uvx pocket-tts serve
# Open http://localhost:8000

Available Voices

Pre-made voices (use name directly with --voice):

| Voice | Gender | License | Description | |-------|--------|---------|-------------| | alba | Female | CC BY 4.0 | Casual voice | | marius | Male | CC0 | Voice donation | | javert | Male | CC0 | Voice donation | | jean | Male | CC-NC | EARS dataset | | fantine | Female | CC BY 4.0 | VCTK dataset | | cosette | Female | CC-NC | Expresso dataset | | eponine | Female | CC BY 4.0 | VCTK dataset | | azelma | Female | CC BY 4.0 | VCTK dataset |

Full voice catalog: https://huggingface.co/kyutai/tts-voices

For detailed voice information, see references/voices.md.

Voice Cloning

Clone any voice from an audio sample. For best results:

Use clean audio (minimal background noise)
10+ seconds recommended
Consider Adobe Podcast Enhance to clean samples

pocket-tts generate --voice ./my_recording.wav --text "Hello" --output-path cloned.wav

Output Format

Sample Rate: 24kHz
Channels: Mono
Format: 16-bit PCM WAV
Default location: ./tts_output.wav

Python API

For programmatic use:

from pocket_tts import TTSModel
import scipy.io.wavfile

tts_model = TTSModel.load_model()
voice_state = tts_model.get_state_for_audio_prompt("alba")
audio = tts_model.generate_audio(voice_state, "Hello world!")

# Save to specific location
scipy.io.wavfile.write("./audio/output.wav", tts_model.sample_rate, audio.numpy())

TTSModel.load_model()

model = TTSModel.load_model(
    variant="b6369a24",      # Model variant
    temp=0.7,                # Temperature (0.0-1.0)
    lsd_decode_steps=1,      # Generation steps
    noise_clamp=None,        # Max noise value
    eos_threshold=-4.0       # End-of-sequence threshold
)

Voice State

# Pre-made voice
voice_state = model.get_state_for_audio_prompt("alba")

# Local file
voice_state = model.get_state_for_audio_prompt("./my_voice.wav")

# HuggingFace
voice_state = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")

Generate Audio

audio = model.generate_audio(voice_state, "Text to speak")
# Returns: torch.Tensor (1D)

Streaming

for chunk in model.generate_audio_stream(voice_state, "Long text..."):
    # Process each chunk as it's generated
    pass

Properties

model.sample_rate - 24000 Hz
model.device - "cpu" or "cuda"

Performance

~200ms latency to first audio chunk
~6x real-time on MacBook Air M4 CPU
Uses only 2 CPU cores

Limitations

English only
No built-in pause/silence control

五维分析

清晰度8/10

创新性7/10

实用性9/10

完整性9/10

可维护性8/10

优缺点分析

优点

支持多种声音和语音克隆
轻量且高效，适用于CPU使用
使用清晰的CLI界面，易于使用
灵活的质量和输出选项

缺点

仅限于英语语言
没有内置的暂停或静音控制
需要特定的Python和PyTorch版本
语音克隆可能需要清晰的音频