AI Termcirca 2016· Added May 29, 2026
WordPiece Tokenization
WordPiece tokenization splits words into subword units for more efficient text processing in NLP tasks.
Developed initially for speech recognition tasks, WordPiece tokenization fragments words into smaller subwords and characters. This approach balances vocabulary size with out-of-vocabulary (OOV) word handling efficiency. It enables models like BERT to dynamically compose meaning from partial word segments, improving performance on unseen terms and allowing efficient model training with limited vocabulary files. This method is particularly effective in dealing with languages rich in morphology where new forms are frequently encountered across corpora.
Examples
- In BERT preprocessing, "unbelievably" might be split into ['un', '##believ', '##ably'].
- Handling low-frequency terms like 'archaeopteryx' as ['arch', '##aeo', '##pte', '##ryx'].
- Tokenizing compound German words effectively by breaking them down further.
Common misconceptions
- It's only used for English—many languages benefit from subword tokenization methods.
- Subwords always correspond directly to syllables—not necessarily true; they're data-driven.
Related terms
Want more like this?
Open the full library
Fresh AI mastery content every 2 hours.