Full Text Search & Elasticsearch β€” Lecture Notes


The LIKE Query Approach

SELECT * FROM products
WHERE name ILIKE '%laptop%'
   OR description ILIKE '%laptop%';

This works fine at small scale. At millions of rows, it breaks.

Why It Fails

Performance: The database performs a full sequential scan β€” it reads every row, examines every text field, and does character-by-character pattern matching. No index can help with a leading wildcard (%laptop%).

ScaleLatency
5,000 products~50ms
Millions of products30+ seconds

No relevance: Results come back in arbitrary order. A product where β€œlaptop” appears once in a buried description ranks the same as a MacBook Pro with β€œlaptop” in the title, name, and specs.

No typo tolerance: labtop β†’ zero results. The query is exact character matching only.

The Librarian Analogy

Think of a relational database as a librarian with a fatal flaw: to find books about β€œmachine learning,” they go shelf by shelf, book by book, checking every title and every page. Thorough, but painfully slow. And once they find matches, they have no way to rank which result is more relevant β€” they return them in whatever order they were found.


2. The Solution: Inverted Index

The Key Insight

Instead of searching through documents to find terms, flip the problem: while storing documents, extract all terms and build an index that maps terms β†’ documents.

How it Works

At storage time (not query time), every word in every document is extracted and indexed:

Term: "machine"
  β†’ Introduction to Machine Learning  (pages 1, 15, 23)
  β†’ The Machine Age                   (pages 5, 89)
  β†’ Coffee Machine Manual             (page 1)

Term: "learning"
  β†’ Introduction to Machine Learning  (pages 1, 16, 24)
  β†’ Learning to Cook                  (pages 3, 45)
  β†’ Deep Learning Fundamentals        (pages 2, 8, 31)

At query time for β€œmachine learning”:

  1. Look up β€œmachine” in the index β†’ get document list instantly
  2. Look up β€œlearning” in the index β†’ get document list instantly
  3. Find intersection / union β†’ score by relevance
  4. Return ranked results

No scanning. No waiting. Just lookups.

This technique β€” inverted index β€” is the core idea that powers Elasticsearch, Apache Lucene, and even PostgreSQL’s built-in full text search.


3. Elasticsearch

Background

Elasticsearch is built on top of Apache Lucene β€” the open-source library that implements the inverted index and relevance scoring. Lucene is the engine; Elasticsearch wraps it with a distributed, scalable, REST API-based interface.

Other tools (including PostgreSQL’s tsvector/tsquery full text search) also use similar inverted index concepts.

Data Model

In Elasticsearch, the equivalent of a database row is called a document (JSON format, similar to MongoDB). A collection of documents is called an index.

{
  "id": "1",
  "review": "Great MacBook Pro, perfect for development",
  "sentiment": "positive"
}

Field types matter:

  • text β€” analyzed; broken into terms, good for full text search
  • keyword β€” not analyzed; exact match only (useful for filtering, e.g., sentiment = "positive")

4. Relevance Scoring β€” BM25

Elasticsearch uses the BM25 algorithm to score how relevant a document is to a query. You don’t need to understand the math β€” just understand the factors that influence the score:

FactorDescriptionEffect on Score
Term FrequencyHow often the search term appears in this documentMore occurrences β†’ higher score
Document FrequencyHow common the term is across all documentsRare terms β†’ higher score (more distinctive)
Document LengthLength of the documentShort doc with many matches β†’ higher score than long doc with few
Field BoostingWhich field the term appears inTitle > Description > Content (configurable)

Field Boosting Example

Query: "machine learning"

Book: "Introduction to Machine Learning"
  β†’ Term in title β†’ big relevance boost
  β†’ Term appears many times β†’ additional boost
  β†’ Score: HIGH β†’ ranked #1

Book: "The Machine Age"
  β†’ Term in title β†’ boost
  β†’ Term appears less frequently
  β†’ Score: MEDIUM β†’ ranked #2

Book: "Some Random Book" (mentions "machine" once on last page)
  β†’ Term not in title, mentioned once
  β†’ Score: LOW β†’ ranked #3 or filtered out

Elasticsearch can detect and correct typos. Searching for "treading" when you meant "trending" still returns relevant results. The algorithm uses edit distance (Levenshtein distance) to find close matches.

This is how Google, Amazon, and similar platforms handle misspellings gracefully.

Type-Ahead / Autocomplete

As a user types, Elasticsearch can return partial-match suggestions in real time β€” the β€œsearch-as-you-type” experience on Amazon, YouTube, etc.

A single query can search across multiple fields with different weights:

{
  "query": {
    "multi_match": {
      "query": "machine learning",
      "fields": ["title^3", "description^2", "content^1"]
    }
  }
}

^3 means β€œtitle matches count 3x more than content matches.”


PostgreSQL ILIKEPostgreSQL Full Text SearchElasticsearch
Speed at scaleVery slow (full scan)Fast (inverted index)Very fast (optimized distributed inverted index)
Relevance rankingNoneBasicAdvanced (BM25, field boosting, custom scoring)
Typo toleranceNoLimitedYes (fuzzy matching)
Setup complexityNone (already in DB)LowHigher (separate service)
When to useNever for large datasetsGood for simple search on existing Postgres dataWhen search is a core feature, or you already use Elasticsearch (ELK stack)

Decision Rule

  • Already using PostgreSQL, simple search needs β†’ use PostgreSQL full text search (tsvector, tsquery)
  • Company already uses Elasticsearch (ELK stack for logging) β†’ use Elasticsearch for search too
  • Search is a core product feature requiring relevance + typo tolerance at scale β†’ Elasticsearch

7. The ELK Stack

Elasticsearch is commonly used as part of the ELK Stack:

ComponentRole
ElasticsearchFast search and storage
Logstash (or Logstash)Log ingestion and transformation pipeline
KibanaVisualization and dashboards

If your company uses the ELK stack for log management (very common), Elasticsearch is already in your infrastructure β€” use it for full text search too.


8. Demo Results: ILIKE vs Elasticsearch

Setup: 50,000 product reviews in both PostgreSQL (Neon) and Elasticsearch (Elastic Cloud), same US-west region.

QueryPostgreSQL ILIKEElasticsearch
”laptop”~3–4 seconds~1 second
”only” (~8,000 matches)~7.5 seconds~500ms

Same number of results. Elasticsearch is 6–15x faster on this dataset. The gap widens as dataset size grows.


9. Practical Advice

Elasticsearch is a tool, not a mastery requirement. Unlike databases (which you must deeply understand β€” indexes, query plans, transactions, etc.), Elasticsearch is something you can get productive with quickly:

  • Know when to use it (search-heavy features, type-ahead, typo tolerance, log search)
  • Refer to the official docs or Elasticsearch DSL examples for your specific query type
  • Use field boosting and the multi-match query for most search use cases
  • For complex scoring, read the BM25 docs β€” but don’t block on theory

The Elasticsearch Query DSL is JSON-based, well-documented, and most common patterns (basic search, fuzzy, type-ahead, filters) are copy-paste ready from docs.


Quick Revision Checklist

  • ILIKE '%term%' = full table scan = slow at scale; no relevance, no typo tolerance
  • Inverted index = map from terms β†’ documents (flip the search direction); eliminates scanning
  • Elasticsearch is built on Apache Lucene (which implements the inverted index)
  • Elasticsearch document = JSON object; index = collection of documents
  • Field types: text (analyzed, full text search) vs keyword (exact match, filtering)
  • BM25 scoring factors: term frequency, document frequency, document length, field boosting
  • Field boosting: title match > description match > content match (configurable)
  • Typo tolerance: fuzzy matching via edit distance (Levenshtein)
  • Type-ahead: partial match suggestions as user types
  • PostgreSQL has built-in full text search (tsvector/tsquery) β€” good for simple needs
  • Choose Elasticsearch when: search is core feature, or already using ELK stack for logs
  • ELK Stack: Elasticsearch + Logstash + Kibana β†’ common log management + visualization stack
  • Demo: Elasticsearch is 6–15x faster than ILIKE on 50k rows; gap grows with scale
  • Priority: master databases deeply; treat Elasticsearch as a specialized tool (docs-first approach is fine)