Co-Pilot

Updated 4 months ago

voxtype

Name: voxtype
Rating: 4.1 (290 reviews)
Author: peteonrails

Ppeteonrails

0.3k

peteonrails/voxtype

Agent Score

💡 Summary

Voxtype is a push-to-talk voice-to-text tool for Linux, optimized for Wayland and X11.

🎯 Target Audience

Linux desktop usersDevelopers needing voice transcriptionAccessibility advocatesContent creatorsMultilingual speakers

🤖 AI Roast: “Powerful, but the setup might scare off the impatient.”

Security AnalysisMedium Risk

Risk: Medium. Review: shell/CLI command execution; outbound network access (SSRF, data egress); filesystem read/write scope and path traversal. Run with least privilege and audit before enabling in production.

Voxtype

voxtype.io

Push-to-talk voice-to-text for Linux. Optimized for Wayland, works on X11 too.

Hold a hotkey (default: ScrollLock) while speaking, release to transcribe and output the text at your cursor position.

Features

Works on any Linux desktop - Uses compositor keybindings (Hyprland, Sway, River) with evdev fallback for X11 and other environments
Fully offline by default - Uses whisper.cpp for local transcription, with optional remote server support
Fallback chain - Types via wtype (best CJK support), falls back to dotool (keyboard layout support), ydotool, then clipboard
Push-to-talk or Toggle mode - Hold to record, or press once to start/stop
Audio feedback - Optional sound cues when recording starts/stops
Configurable - Choose your hotkey, model size, output mode, and more
Waybar integration - Optional status indicator shows recording state in your bar

Quick Start

# 1. Build
cargo build --release

# 2. Install typing backend (Wayland)
# Fedora:
sudo dnf install wtype
# Arch:
sudo pacman -S wtype
# Ubuntu:
sudo apt install wtype

# 3. Download whisper model
./target/release/voxtype setup --download

# 4. Add keybinding to your compositor
# See "Compositor Keybindings" section below

# 5. Run
./target/release/voxtype

Compositor Keybindings

Voxtype works best with your compositor's native keybindings. Add these to your compositor config:

Hyprland (~/.config/hypr/hyprland.conf):

bind = SUPER, V, exec, voxtype record start
bindr = SUPER, V, exec, voxtype record stop

Sway (~/.config/sway/config):

bindsym --no-repeat $mod+v exec voxtype record start
bindsym --release $mod+v exec voxtype record stop

River (~/.config/river/init):

riverctl map normal Super V spawn 'voxtype record start'
riverctl map -release normal Super V spawn 'voxtype record stop'

Then disable the built-in hotkey in your config:

# ~/.config/voxtype/config.toml
[hotkey]
enabled = false

X11 / Built-in hotkey fallback: If you're on X11 or prefer voxtype's built-in hotkey (ScrollLock by default), add yourself to the input group: sudo usermod -aG input $USER and log out/in. See the User Manual for details.

Omarchy / Multi-modifier keybindings: If using keybindings with multiple modifiers (e.g., SUPER+CTRL+X), releasing keys slowly can cause typed text to trigger window manager shortcuts instead of inserting text. See Modifier Key Interference in the troubleshooting guide for the solution using output hooks and Hyprland submaps.

Usage

Run voxtype (it runs as a foreground daemon)
Hold ScrollLock (or your configured hotkey)
Speak
Release the key
Text appears at your cursor (or in clipboard if typing isn't available)

Press Ctrl+C to stop the daemon.

Toggle Mode

If you prefer to press once to start recording and again to stop (instead of holding):

# Via command line
voxtype --toggle

# Or in config.toml
[hotkey]
key = "SCROLLLOCK"
mode = "toggle"

Configuration

Config file location: ~/.config/voxtype/config.toml

[hotkey]
key = "SCROLLLOCK"  # Or: PAUSE, F13-F24, RIGHTALT, etc.
modifiers = []      # Optional: ["LEFTCTRL", "LEFTALT"]
# mode = "toggle"   # Uncomment for toggle mode (press to start/stop)

[audio]
device = "default"  # Or specific device from `pactl list sources short`
sample_rate = 16000
max_duration_secs = 60

# Audio feedback (sound cues when recording starts/stops)
# [audio.feedback]
# enabled = true
# theme = "default"   # "default", "subtle", "mechanical", or path to custom dir
# volume = 0.7        # 0.0 to 1.0

[whisper]
model = "base.en"   # tiny, base, small, medium, large-v3, large-v3-turbo
language = "en"     # Or "auto" for detection, or language code (es, fr, de, etc.)
translate = false   # Translate non-English speech to English
# threads = 4       # CPU threads for inference (omit for auto-detect)
# on_demand_loading = true  # Load model only when recording (saves memory)

[output]
mode = "type"       # "type", "clipboard", or "paste"
fallback_to_clipboard = true
type_delay_ms = 0   # Increase if characters are dropped
# auto_submit = true  # Send Enter after transcription (for chat apps, terminals)
# Note: "paste" mode copies to clipboard then simulates Ctrl+V
#       Useful for non-US keyboard layouts where ydotool typing fails

[output.notification]
on_recording_start = false  # Notify when PTT activates
on_recording_stop = false   # Notify when transcribing
on_transcription = true     # Show transcribed text

# Text processing (word replacements, spoken punctuation)
# [text]
# spoken_punctuation = true  # Say "period" → ".", "open paren" → "("
# replacements = { "vox type" = "voxtype", "oh marky" = "Omarchy" }

# State file for Waybar/polybar integration (enabled by default)
state_file = "auto"  # Or custom path, or "disabled" to turn off

Audio Feedback

Enable audio feedback to hear a sound when recording starts and stops:

[audio.feedback]
enabled = true
theme = "default"  # Built-in themes: default, subtle, mechanical
volume = 0.7       # 0.0 to 1.0

Built-in themes:

default - Clear, pleasant two-tone beeps
subtle - Quiet, unobtrusive clicks
mechanical - Typewriter/keyboard-like sounds

Custom themes: Point theme to a directory containing start.wav, stop.wav, and error.wav files.

Text Processing

Voxtype can post-process transcribed text with word replacements and spoken punctuation.

Word replacements fix commonly misheard words:

[text]
replacements = { "vox type" = "voxtype", "oh marky" = "Omarchy" }

Spoken punctuation (opt-in) converts spoken words to symbols - useful for developers:

[text]
spoken_punctuation = true

With this enabled, saying "function open paren close paren" outputs function(). Supports period, comma, brackets, braces, newlines, and many more. See CONFIGURATION.md for the full list.

Post-Processing Command (Advanced)

For advanced cleanup, you can pipe transcriptions through an external command like a local LLM for grammar correction, filler word removal, or text formatting:

[output.post_process]
command = "ollama run llama3.2:1b 'Clean up this dictation. Fix grammar, remove filler words:'"
timeout_ms = 30000  # 30 second timeout for LLM

The command receives text on stdin and outputs cleaned text on stdout. On any failure (timeout, error), Voxtype gracefully falls back to the original transcription.

See CONFIGURATION.md for more examples including scripts for LM Studio, Ollama, and llama.cpp.

CLI Options

voxtype [OPTIONS] [COMMAND]

Commands:
  daemon      Run as background daemon (default)
  transcribe  Transcribe an audio file
  setup       Setup and installation utilities
  config      Show current configuration
  status      Show daemon status (for Waybar/polybar integration)
  record      Control recording from external sources (compositor keybindings, scripts)

Setup subcommands:
  voxtype setup              Run basic dependency checks (default)
  voxtype setup --download   Download the configured Whisper model
  voxtype setup systemd      Install/manage systemd user service
  voxtype setup waybar       Generate Waybar module configuration
  voxtype setup model        Interactive model selection and download
  voxtype setup gpu          Manage GPU acceleration (switch CPU/Vulkan)

Status options:
  voxtype status --format json       Output as JSON (for Waybar)
  voxtype status --follow            Continuously output on state changes
  voxtype status --extended          Include model, device, backend in JSON
  voxtype status --icon-theme THEME  Icon theme (emoji, nerd-font, material, etc.)

Record subcommands (for compositor keybindings):
  voxtype record start                     Start recording (send SIGUSR1 to daemon)
  voxtype record start --output-file PATH  Write transcription to a file
  voxtype record stop                      Stop recording and transcribe (send SIGUSR2 to daemon)
  voxtype record toggle                    Toggle recording state

Options:
  -c, --config <FILE>  Path to config file
  -v, --verbose        Increase verbosity (-v, -vv)
  -q, --quiet          Quiet mode (errors only)
  --clipboard          Force clipboard mode
  --paste              Force paste mode (clipboard + Ctrl+V)
  --model <MODEL>      Override whisper model
  --hotkey <KEY>       Override hotkey
  --toggle             Use toggle mode (press to start/stop)
  -h, --help           Print help
  -V, --version        Print version

Whisper Models

| Model | Size | English WER | Speed | |-------|------|-------------|-------| | tiny.en | 39 MB | ~10% | Fastest | | base.en | 142 MB | ~8% | Fast | | small.en | 466 MB | ~6% | Medium | | medium.en | 1.5 GB | ~5% | Slow | | large-v3 | 3 GB | ~4% | Slowest | | large-v3-turbo | 1.6 GB | ~4% | Fast |

For most uses, base.en provides a good balance of speed and accuracy. If you have a GPU, large-v3-turbo offers excellent accuracy with fast inference.

Multilingual Support

The .en models are English-only but faster and more accurate for English. For other languages, use large-v3 which supports 99 languages.

Use Case 1: Transcribe in the spoken language (speak French, output French)

[whisper]
model = "large-v3"
language = "auto"     # Auto-detect and transcribe in that language
translate = false

Use Case 2: Translate to English (speak French, output English)

[whisper]
model = "large-v3"
language = "auto"     # Auto-detect the spoken language
translate = true      # Translate output to English

**Use Cas

5-Dim Analysis

Clarity8/10

Novelty7/10

Utility9/10

Completeness9/10

Maintainability8/10

Pros & Cons

Pros

Fully offline by default
Highly configurable
Supports multiple Linux environments
Audio feedback for user interaction

Cons

Limited to Linux platforms
Requires configuration for optimal use
Dependency on external libraries
Potential learning curve for new users

Related Skills

pytorch

toolCode Lib

92/ 100

“It's the Swiss Army knife of deep learning, but good luck figuring out which of the 47 installation methods is the one that won't break your system.”

View Analysis

agno

toolCode Lib

90/ 100

“It promises to be the Kubernetes for agents, but let's see if developers have the patience to learn yet another orchestration layer.”

View Analysis

nuxt-skills

toolCo-Pilot

90/ 100

“It's essentially a well-organized cheat sheet that turns your AI assistant into a Nuxt framework parrot.”

View Analysis

Disclaimer: This content is sourced from GitHub open source projects for display and rating purposes only.