Skip to content

Model Specifications

Detailed technical specifications for all transcription models supported by FloWords.


Whisper is OpenAI’s automatic speech recognition (ASR) system. FloWords uses whisper.cpp, an optimized C++ implementation for Apple Silicon.

ModelParametersSizeVRAMEnglish WERSpeed Factor
Tiny39M75 MB~1 GB8.4%~32x
Tiny.en39M75 MB~1 GB7.5%~32x
Base74M142 MB~1.5 GB5.0%~16x
Base.en74M142 MB~1.5 GB4.3%~16x
Small244M466 MB~2 GB3.4%~6x
Small.en244M466 MB~2 GB3.0%~6x
Medium769M1.5 GB~5 GB2.5%~2x
Medium.en769M1.5 GB~5 GB2.1%~2x
Large-v31550M3 GB~10 GB2.0%~1x

Name: Whisper Tiny
Parameters: 39 million
File Size: 75 MB
Memory Usage: ~1 GB
Layers: 4 encoder, 4 decoder
Dimension: 384
Heads: 6

Best for:

  • Quick notes
  • Low-memory systems
  • Fastest transcription

Trade-offs:

  • Lower accuracy
  • May struggle with accents
  • Limited noise handling

Name: Whisper Base
Parameters: 74 million
File Size: 142 MB
Memory Usage: ~1.5 GB
Layers: 6 encoder, 6 decoder
Dimension: 512
Heads: 8

Best for:

  • General daily use
  • Good balance of speed and accuracy
  • Most Mac configurations

Trade-offs:

  • Moderate accuracy
  • Some errors with technical terms

Name: Whisper Small
Parameters: 244 million
File Size: 466 MB
Memory Usage: ~2 GB
Layers: 12 encoder, 12 decoder
Dimension: 768
Heads: 12

Best for:

  • Better accuracy needs
  • Professional work
  • When speed isn’t critical

Trade-offs:

  • Slower than Tiny/Base
  • Higher memory requirements

Name: Whisper Medium
Parameters: 769 million
File Size: 1.5 GB
Memory Usage: ~5 GB
Layers: 24 encoder, 24 decoder
Dimension: 1024
Heads: 16

Best for:

  • Professional transcription
  • Challenging audio conditions
  • Accented speech

Trade-offs:

  • Significant memory usage
  • Slower processing
  • Requires 8GB+ RAM

Name: Whisper Large v3
Parameters: 1550 million
File Size: 3 GB
Memory Usage: ~10 GB
Layers: 32 encoder, 32 decoder
Dimension: 1280
Heads: 20

Best for:

  • Maximum accuracy
  • Difficult audio
  • Professional production

Trade-offs:

  • Very high memory usage
  • Slowest processing
  • Requires 16GB+ RAM

Models ending in .en are optimized for English only:

ModelMultilingualEnglish-Only
Tinytinytiny.en
Basebasebase.en
Smallsmallsmall.en
Mediummediummedium.en
Largelarge-v3(No .en variant)
  • Faster processing - No language detection
  • Slightly better accuracy - Optimized for English
  • Lower resource usage - Smaller effective vocabulary

NVIDIA’s Parakeet models via FluidAudio framework.

Name: Parakeet RNNT
Architecture: RNN-Transducer
Focus: Real-time English ASR
Latency: Very low
Streaming: Yes

Best for:

  • Real-time transcription
  • Live dictation
  • English content
AspectWhisperParakeet
Languages99+English focus
AccuracyHigherGood
LatencyHigherLower
StreamingLimitedNative
MemoryHigherLower

ModelRecommendedPerformance
Tiny✓ Excellent~30x real-time
Base✓ Excellent~15x real-time
Small✓ Good~5x real-time
Medium⚠️ Usable~2x real-time
Large⚠️ Slow~0.5x real-time
Model8GB RAM16GB RAM32GB+ RAM
Tiny✓ Good✓ Good✓ Good
Base✓ Usable✓ Good✓ Good
Small⚠️ Slow✓ Usable✓ Good
Medium❌ Not recommended⚠️ Slow✓ Usable
Large❌ Not recommended❌ Not recommended⚠️ Slow

English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Japanese, Chinese, Korean

Arabic, Czech, Danish, Finnish, Greek, Hebrew, Hindi, Hungarian, Indonesian, Norwegian, Polish, Romanian, Swedish, Thai, Turkish, Ukrainian, Vietnamese

All other 70+ languages supported by Whisper


SpecificationValue
Sample Rate16000 Hz
Bit Depth16-bit
ChannelsMono
FormatPCM

FloWords automatically converts audio to these specifications.

FormatExtensionNotes
WAV.wavNative support
MP3.mp3Converted to WAV
M4A.m4aConverted to WAV
AAC.aacConverted to WAV
FLAC.flacConverted to WAV
AIFF.aiffConverted to WAV
CAF.cafConverted to WAV
MP4.mp4Audio extracted
MOV.movAudio extracted

Use CaseRecommended Model
Quick notesTiny
Daily useBase
DocumentsSmall
ProfessionalMedium
Maximum accuracyLarge-v3
RAMMaximum Model
4 GBBase
8 GBSmall
16 GBMedium
32 GB+Large-v3
ContentRecommended
Clear speechAny
Background noiseMedium+
Technical termsMedium+ with dictionary
Multiple speakersMedium+
Accented speechMedium+

ParameterRangeDefaultEffect
beam_size1-105Accuracy vs speed
best_of1-51Candidates considered
temperature0.0-1.00.0Prediction randomness
patience0.0-2.01.0Early stopping
length_penalty0.0-2.01.0Length bias
GoalAdjustment
FasterLower beam_size
More accurateHigher beam_size, best_of
More varietyHigher temperature
Shorter outputsLower length_penalty

FloWords checks for model updates automatically. To manually check:

  1. Open Settings > Model
  2. Click Check for Updates

Model updates may include:

  • Accuracy improvements
  • New language support
  • Performance optimizations
  • Bug fixes