Natural Language Processing (NLP) is a field of computer science and artificial intelligence that enables computers to understand, interpret, and generate natural language, facilitating seamless communication between humans and machines. Its evolution spans statistical models, deep learning models, pre-trained models, and ultimately large language models (LLMs) as the culmination of progress. Here I try to mark some important technology evolutions about NLP and show the history of LLMs. I will also provide some important articles about the technique details. If you are interested in the the whole history of AI, I recommend Annotated History of Modern AI and Deep Learning.
Statistical Models
The n-gram model (Class-Based n-gram Models of Natural Language) predicted the next word or generated text by analyzing the frequency of consecutive n words in a corpus, capturing local contextual features.
Deep Learning Models
- Feedforward neural networks were initially applied to language modeling but were gradually replaced by more advanced RNNs (recurrent neural networks; A Critical Review of Recurrent Neural Networks for Sequence Learning), which struggled with long-term dependencies due to the vanishing/exploding gradient problem.
- LSTM (Long Short-Term Memory; Long Short-term Memory RNN) introduced gating mechanisms to retain long-range dependencies, overcoming gradient issues and improving contextual understanding.
- In 2013, word2vec (Efficient Estimation of Word Representations in Vector Space) pioneered word embeddings by mapping words to low-dimensional vectors, enabling computational semantics and advancing language understanding.
Pre-trained Models
- Breakthroughs came with Transformer (Attention Is All You Need), which replaced RNNs, enhanced parallel processing, and improved contextual awareness. Architectures like BERT (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) and GPT leveraged Transformer for pre-training on large corpora, achieving state-of-the-art (SOTA) performance on tasks like text classification, entity recognition, and sentiment analysis.
- ELMo (Deep contextualized word representations) and BERT dominated NLP benchmarks, with BERT becoming the go-to model for over a dozen tasks before the rise of LLMs.
Large Language Models (LLMs)
- GPT (Language Models are Unsupervised Multitask Learners), based on the Transformer-decoder architecture (a "sibling" to BERT), can address most NLP problems through conversational interfaces. With multimodal capabilities, LLMs even extend to computer vision (CV) tasks.
- While powerful, LLMs are not universally deployed for all NLP tasks. Smaller models remain effective for niche applications and can assist LLMs. For example:
- n-gram aids in LLM corpus deduplication.
- word2vec and BERT assist in LLM corpus quality filtering.
 
- Llama
- Qwen
- GPT-4
- DeepSeek
- Gemini


