💡 摘要
Voxtype 是一个为 Linux 优化的按键说话转文本工具,支持 Wayland 和 X11。
🎯 适合人群
🤖 AI 吐槽: “看起来很能打,但别让配置把人劝退。”
风险:Medium。建议检查:是否执行 shell/命令行指令;是否发起外网请求(SSRF/数据外发);文件读写范围与路径穿越风险。以最小权限运行,并在生产环境启用前审计代码与依赖。
Voxtype
Push-to-talk voice-to-text for Linux. Optimized for Wayland, works on X11 too.
Hold a hotkey (default: ScrollLock) while speaking, release to transcribe and output the text at your cursor position.
Features
- Works on any Linux desktop - Uses compositor keybindings (Hyprland, Sway, River) with evdev fallback for X11 and other environments
- Fully offline by default - Uses whisper.cpp for local transcription, with optional remote server support
- Fallback chain - Types via wtype (best CJK support), falls back to dotool (keyboard layout support), ydotool, then clipboard
- Push-to-talk or Toggle mode - Hold to record, or press once to start/stop
- Audio feedback - Optional sound cues when recording starts/stops
- Configurable - Choose your hotkey, model size, output mode, and more
- Waybar integration - Optional status indicator shows recording state in your bar
Quick Start
# 1. Build cargo build --release # 2. Install typing backend (Wayland) # Fedora: sudo dnf install wtype # Arch: sudo pacman -S wtype # Ubuntu: sudo apt install wtype # 3. Download whisper model ./target/release/voxtype setup --download # 4. Add keybinding to your compositor # See "Compositor Keybindings" section below # 5. Run ./target/release/voxtype
Compositor Keybindings
Voxtype works best with your compositor's native keybindings. Add these to your compositor config:
Hyprland (~/.config/hypr/hyprland.conf):
bind = SUPER, V, exec, voxtype record start
bindr = SUPER, V, exec, voxtype record stop
Sway (~/.config/sway/config):
bindsym --no-repeat $mod+v exec voxtype record start
bindsym --release $mod+v exec voxtype record stop
River (~/.config/river/init):
riverctl map normal Super V spawn 'voxtype record start' riverctl map -release normal Super V spawn 'voxtype record stop'
Then disable the built-in hotkey in your config:
# ~/.config/voxtype/config.toml [hotkey] enabled = false
X11 / Built-in hotkey fallback: If you're on X11 or prefer voxtype's built-in hotkey (ScrollLock by default), add yourself to the
inputgroup:sudo usermod -aG input $USERand log out/in. See the User Manual for details.
Omarchy / Multi-modifier keybindings: If using keybindings with multiple modifiers (e.g.,
SUPER+CTRL+X), releasing keys slowly can cause typed text to trigger window manager shortcuts instead of inserting text. See Modifier Key Interference in the troubleshooting guide for the solution using output hooks and Hyprland submaps.
Usage
- Run
voxtype(it runs as a foreground daemon) - Hold ScrollLock (or your configured hotkey)
- Speak
- Release the key
- Text appears at your cursor (or in clipboard if typing isn't available)
Press Ctrl+C to stop the daemon.
Toggle Mode
If you prefer to press once to start recording and again to stop (instead of holding):
# Via command line voxtype --toggle # Or in config.toml [hotkey] key = "SCROLLLOCK" mode = "toggle"
Configuration
Config file location: ~/.config/voxtype/config.toml
[hotkey] key = "SCROLLLOCK" # Or: PAUSE, F13-F24, RIGHTALT, etc. modifiers = [] # Optional: ["LEFTCTRL", "LEFTALT"] # mode = "toggle" # Uncomment for toggle mode (press to start/stop) [audio] device = "default" # Or specific device from `pactl list sources short` sample_rate = 16000 max_duration_secs = 60 # Audio feedback (sound cues when recording starts/stops) # [audio.feedback] # enabled = true # theme = "default" # "default", "subtle", "mechanical", or path to custom dir # volume = 0.7 # 0.0 to 1.0 [whisper] model = "base.en" # tiny, base, small, medium, large-v3, large-v3-turbo language = "en" # Or "auto" for detection, or language code (es, fr, de, etc.) translate = false # Translate non-English speech to English # threads = 4 # CPU threads for inference (omit for auto-detect) # on_demand_loading = true # Load model only when recording (saves memory) [output] mode = "type" # "type", "clipboard", or "paste" fallback_to_clipboard = true type_delay_ms = 0 # Increase if characters are dropped # auto_submit = true # Send Enter after transcription (for chat apps, terminals) # Note: "paste" mode copies to clipboard then simulates Ctrl+V # Useful for non-US keyboard layouts where ydotool typing fails [output.notification] on_recording_start = false # Notify when PTT activates on_recording_stop = false # Notify when transcribing on_transcription = true # Show transcribed text # Text processing (word replacements, spoken punctuation) # [text] # spoken_punctuation = true # Say "period" → ".", "open paren" → "(" # replacements = { "vox type" = "voxtype", "oh marky" = "Omarchy" } # State file for Waybar/polybar integration (enabled by default) state_file = "auto" # Or custom path, or "disabled" to turn off
Audio Feedback
Enable audio feedback to hear a sound when recording starts and stops:
[audio.feedback] enabled = true theme = "default" # Built-in themes: default, subtle, mechanical volume = 0.7 # 0.0 to 1.0
Built-in themes:
default- Clear, pleasant two-tone beepssubtle- Quiet, unobtrusive clicksmechanical- Typewriter/keyboard-like sounds
Custom themes: Point theme to a directory containing start.wav, stop.wav, and error.wav files.
Text Processing
Voxtype can post-process transcribed text with word replacements and spoken punctuation.
Word replacements fix commonly misheard words:
[text] replacements = { "vox type" = "voxtype", "oh marky" = "Omarchy" }
Spoken punctuation (opt-in) converts spoken words to symbols - useful for developers:
[text] spoken_punctuation = true
With this enabled, saying "function open paren close paren" outputs function(). Supports period, comma, brackets, braces, newlines, and many more. See CONFIGURATION.md for the full list.
Post-Processing Command (Advanced)
For advanced cleanup, you can pipe transcriptions through an external command like a local LLM for grammar correction, filler word removal, or text formatting:
[output.post_process] command = "ollama run llama3.2:1b 'Clean up this dictation. Fix grammar, remove filler words:'" timeout_ms = 30000 # 30 second timeout for LLM
The command receives text on stdin and outputs cleaned text on stdout. On any failure (timeout, error), Voxtype gracefully falls back to the original transcription.
See CONFIGURATION.md for more examples including scripts for LM Studio, Ollama, and llama.cpp.
CLI Options
voxtype [OPTIONS] [COMMAND]
Commands:
daemon Run as background daemon (default)
transcribe Transcribe an audio file
setup Setup and installation utilities
config Show current configuration
status Show daemon status (for Waybar/polybar integration)
record Control recording from external sources (compositor keybindings, scripts)
Setup subcommands:
voxtype setup Run basic dependency checks (default)
voxtype setup --download Download the configured Whisper model
voxtype setup systemd Install/manage systemd user service
voxtype setup waybar Generate Waybar module configuration
voxtype setup model Interactive model selection and download
voxtype setup gpu Manage GPU acceleration (switch CPU/Vulkan)
Status options:
voxtype status --format json Output as JSON (for Waybar)
voxtype status --follow Continuously output on state changes
voxtype status --extended Include model, device, backend in JSON
voxtype status --icon-theme THEME Icon theme (emoji, nerd-font, material, etc.)
Record subcommands (for compositor keybindings):
voxtype record start Start recording (send SIGUSR1 to daemon)
voxtype record start --output-file PATH Write transcription to a file
voxtype record stop Stop recording and transcribe (send SIGUSR2 to daemon)
voxtype record toggle Toggle recording state
Options:
-c, --config <FILE> Path to config file
-v, --verbose Increase verbosity (-v, -vv)
-q, --quiet Quiet mode (errors only)
--clipboard Force clipboard mode
--paste Force paste mode (clipboard + Ctrl+V)
--model <MODEL> Override whisper model
--hotkey <KEY> Override hotkey
--toggle Use toggle mode (press to start/stop)
-h, --help Print help
-V, --version Print version
Whisper Models
| Model | Size | English WER | Speed | |-------|------|-------------|-------| | tiny.en | 39 MB | ~10% | Fastest | | base.en | 142 MB | ~8% | Fast | | small.en | 466 MB | ~6% | Medium | | medium.en | 1.5 GB | ~5% | Slow | | large-v3 | 3 GB | ~4% | Slowest | | large-v3-turbo | 1.6 GB | ~4% | Fast |
For most uses, base.en provides a good balance of speed and accuracy. If you have a GPU, large-v3-turbo offers excellent accuracy with fast inference.
Multilingual Support
The .en models are English-only but faster and more accurate for English. For other languages, use large-v3 which supports 99 languages.
Use Case 1: Transcribe in the spoken language (speak French, output French)
[whisper] model = "large-v3" language = "auto" # Auto-detect and transcribe in that language translate = false
Use Case 2: Translate to English (speak French, output English)
[whisper] model = "large-v3" language = "auto" # Auto-detect the spoken language translate = true # Translate output to English
**Use Cas
优点
- 默认完全离线
- 高度可配置
- 支持多种 Linux 环境
- 用户交互的音频反馈
缺点
- 仅限于 Linux 平台
- 需要配置以获得最佳使用体验
- 依赖外部库
- 新用户可能需要学习曲线
相关技能
免责声明:本内容来源于 GitHub 开源项目,仅供展示和评分分析使用。
版权归原作者所有 peteonrails.

