AI Termcirca 2016· Added May 29, 2026

WordPiece Tokenization

WordPiece tokenization splits words into subword units for more efficient text processing in NLP tasks.

Developed initially for speech recognition tasks, WordPiece tokenization fragments words into smaller subwords and characters. This approach balances vocabulary size with out-of-vocabulary (OOV) word handling efficiency. It enables models like BERT to dynamically compose meaning from partial word segments, improving performance on unseen terms and allowing efficient model training with limited vocabulary files. This method is particularly effective in dealing with languages rich in morphology where new forms are frequently encountered across corpora.

Examples

In BERT preprocessing, "unbelievably" might be split into ['un', '##believ', '##ably'].
Handling low-frequency terms like 'archaeopteryx' as ['arch', '##aeo', '##pte', '##ryx'].
Tokenizing compound German words effectively by breaking them down further.

Common misconceptions

It's only used for English—many languages benefit from subword tokenization methods.
Subwords always correspond directly to syllables—not necessarily true; they're data-driven.

Want more like this?

Open the full library

Plain-English AI lessons, prompts and guides.

Start free Browse library