NLP: The Subtle Orchestra of Language
Series of Articles on AI
This is the second article in a series of four:
- LLMs: understanding what they are and how they work.
- NLP: a deep dive into the fundamental building blocks of natural language processing (this article).
- AI Agents: discovering autonomous artificial intelligences.
- Comparison and AI Smarttalk’s positioning: synthesis and perspective.
If language were a symphony, its score would be infinitely complex—sometimes grand, sometimes intimate—driven by the diversity of languages, contexts, and cultural nuances. At the heart of this symphony lies a subtle yet crucial orchestra: NLP (Natural Language Processing), which orchestrates words and meaning in the world of AI.
In the first article, we likened LLMs (Large Language Models) to enormous swarms of bees producing textual honey. Here, we’re returning to fundamental—often more discreet—building blocks that underpin how text is understood and generated in AI. This exploration will help you grasp:
- The historical roots of NLP
- The main methods and techniques (statistical, symbolic, neural)
- The key stages of an NLP pipeline (tokenization, stemming, lemmatization, etc.)
- The varied applications (semantic analysis, translation, automatic summarization...)
- The ethical, cultural, and technological challenges
- How classical NLP coexists with LLMs and what differentiates one from the other
We’ll see that NLP can be viewed as a set of musicians each playing a part: tokenization is the subtle flute, morphological analysis the thoughtful clarinet, syntax dependency the cello grounding the melody, and so on. From this harmony emerges a comprehension (or at least a manipulation) of natural language.
Ready to tune your instruments? Let’s dive into NLP, that subtle orchestra conductor of language.
1. Definition and History: When Language Became (Also) a Matter for Machines
1.1. Early Steps: Computational Linguistics and Symbolic Approaches
NLP dates back several decades, long before the advent of powerful LLMs. As early as the 1950s and ’60s, researchers wondered how to make machines process language. The first approaches were mostly symbolic: people tried to manually code grammatical rules, word lists, and ontologies (representing world concepts), among others.
These so-called “knowledge-based” methods rely on the assumption that if you provide enough linguistic rules, the system can analyze and generate text accurately. Unfortunately, human language is so complex that it’s nearly impossible to codify every linguistic nuance in fixed rules.
Example of Linguistic Complexity
In French, the rules of gender for nouns have countless exceptions (e.g., “le poêle” vs. “la poêle,” “le mousse” vs. “la mousse,” etc.). Every rule can spawn new counterexamples, and the list of special cases keeps growing.
1.2. The Statistical Era: When Numbers Were Allowed to Speak
As computing power progressed, statistical approaches to NLP arose: instead of manually coding rules, the machine infers patterns from annotated data.
For example, you can assemble a corpus of translated texts and learn a probabilistic model that calculates the likelihood that a word in the source language corresponds to a word (or group of words) in the target language. This is how, in the early 2000s, statistical machine translation (such as Google Translate) took off, primarily relying on methods like Hidden Markov Models or aligned phrases.
Gradually, simple count-based methods (word occurrences) and analytic approaches (n-grams, TF-IDF, etc.) proved highly effective for classification or keyword-detection tasks. Researchers discovered that language largely follows statistical patterns, although these are far from explaining everything.
1.3. The Age of Neural Networks: RNN, LSTM, and Transformers
The 2010s brought large-scale neural models, starting with RNNs (Recurrent Neural Networks), LSTMs (Long Short-Term Memory), and GRUs (Gated Recurrent Units). These architectures enabled better handling of word order and context in a sentence compared to purely statistical approaches.
Then in 2017, the paper “Attention is all you need” introduced Transformers, sparking the wave that led to LLMs (GPT, BERT, etc.). Yet even with this spectacular advance, the fundamental building blocks of NLP still matter: we still talk about tokenization, lemmatization, syntactic analysis, and so on, even if they’re sometimes integrated implicitly into these large models.
2. Key Stages of an NLP Pipeline: The Orchestra in Action
To better understand the richness of NLP, let’s imagine a classic pipeline where text passes through different stages (different “musicians”):
2.1. Tokenization: The Flute That Provides the Basic Notes
Tokenization breaks down text into elementary units known as tokens. In languages like French, this often aligns with words separated by spaces or punctuation, though it’s not always straightforward (contractions, embedded punctuation, etc.).
It’s the indispensable first step of any NLP pipeline, because the machine doesn’t “understand” raw character strings. Proper tokenization makes it easier to work with these units of meaning.
2.2. Normalization and Noise Removal
Once you’ve split the text, you can normalize it (e.g., convert to lowercase), remove unnecessary punctuation or stop words (function words like “the,” “and,” “of,” which don’t always carry meaning).
It’s also at this stage that you address linguistic specifics: handling accents in French, character segmentation in Chinese, and so on. This phase is somewhat like a clarinet clarifying the melody by filtering out extra noise.
2.3. Stemming vs. Lemmatization: The Viola and Violin of Morphological Analysis
- Stemming: It trims words down to a “radical” form by removing suffixes. For example, “manger,” “manges,” “mangeons” might become “mang.” It’s fast but imprecise since the radical isn’t always a valid word.
- Lemmatization: It identifies the canonical form of the word (its lemma), such as “manger” (to eat). It’s more accurate but requires a more elaborate lexicon or linguistic rules.
Both methods help reduce lexical variability and group words sharing the same semantic root. It’s akin to the viola and violin tuning their notes to create a harmonious ensemble.
2.4. Syntactic Analysis (Parsing), Part-of-Speech Tagging (POS Tagging)
Syntactic analysis identifies a sentence’s structure—for instance, which is the subject, the verb, the object, which are the adverbial clauses, etc. Often referred to as “parsing,” it can be done using dependency systems or constituency trees.
POS tagging assigns each token a grammatical category (noun, verb, adjective, etc.). It’s crucial for deeper understanding: knowing whether “bank” is a noun (a place to sit, in French “banc”) or a verb, for instance, changes how the phrase is interpreted.
2.5. Semantic Analysis, Named Entity Recognition
Semantic analysis aims to grasp the meaning of words and sentences. This can include sentiment analysis (“Is the text positive, negative, or neutral?”), named entity recognition (people, places, organizations), coreference resolution (knowing which pronoun refers to which noun), and more.
Here the orchestra truly starts to play in harmony: each instrument (step) offers clues about what the text “means” and how its elements connect.
2.6. Final Output: Classification, Summarization, Translation, Generation
Finally, depending on the task, there can be a variety of final outputs: a label (spam/not spam), a translation, a summary, etc. Each context corresponds to a different “piece,” performed by the NLP orchestra.
Of course, in modern LLMs, many of these steps are integrated or implicitly “learned.” But in practice, for targeted applications, we often still use these modules in a more modular fashion.
3. Main NLP Methods: Symbolic, Statistical, and Neural Scores
3.1. Symbolic Approaches
Based on explicit rules, these approaches attempt to model grammar, semantics, and vocabulary. The upside: they can be highly accurate in a narrow domain (e.g., legal contexts with specific coded rules). The downside: they require heavy human effort (linguists and IT experts) and do not generalize well.
3.2. Statistical Approaches
Here, we estimate probabilities from annotated corpora. For example, the probability that one word follows another or that a string of words belongs to a certain category. Classic examples include n-gram models, HMM (Hidden Markov Models), and CRF (Conditional Random Fields).
These approaches dominated NLP from the 1990s through the 2010s, enabling systems like statistical machine translation and large-scale named entity recognition. They can require substantial amounts of data, but generally are less resource-intensive than the most recent neural methods.