03 — Data Management MOC

← Back to Software Engineering - Map of Content

Storing, retrieving, transforming, and moving data. Arguably the most critical concern in most applications — get this wrong and nothing else matters.

Relational Databases

SQL Fundamentals

DDL (Data Definition Language) — CREATE, ALTER, DROP, TRUNCATE
DML (Data Manipulation Language) — SELECT, INSERT, UPDATE, DELETE
Joins — INNER, LEFT/RIGHT OUTER, FULL OUTER, CROSS, self-joins
Subqueries — Correlated, non-correlated, EXISTS, IN
Aggregation — GROUP BY, HAVING, window functions (ROW_NUMBER, RANK, LAG, LEAD, SUM OVER)
CTEs (Common Table Expressions) — WITH clauses, recursive CTEs
Set Operations — UNION, INTERSECT, EXCEPT

Normalization

1NF — Atomic values, no repeating groups
2NF — No partial dependencies on composite keys
3NF — No transitive dependencies
BCNF — Every determinant is a candidate key
Denormalization — Strategic redundancy for read performance (used in Data Modeling)

Indexing

B-Tree Indexes — Default, good for range queries and equality
Hash Indexes — O(1) equality lookups, no range support
Composite Indexes — Multi-column, leftmost prefix rule
Covering Indexes — Include all queried columns, avoid table lookups
Partial/Filtered Indexes — Index a subset of rows
Full-Text Indexes — Text search with ranking (tsvector in Postgres)
GIN/GiST Indexes — JSONB, arrays, geometric data (Postgres)
Index Anti-Patterns — Over-indexing, unused indexes, low selectivity

Query Optimization

Query Plans — EXPLAIN / EXPLAIN ANALYZE
Query Planner — Cost-based optimization, statistics, cardinality estimation
Common Optimizations — Index selection, join ordering, predicate pushdown
N+1 Query Problem — Eager loading, batch loading, DataLoader pattern
Slow Query Analysis — Query logs, pg_stat_statements, slow query logs

Transactions & Concurrency

ACID Properties — Atomicity, Consistency, Isolation, Durability
Isolation Levels — Read Uncommitted, Read Committed, Repeatable Read, Serializable
Concurrency Phenomena — Dirty reads, non-repeatable reads, phantom reads, write skew
Locking — Row-level, table-level, advisory locks, deadlock detection
MVCC (Multi-Version Concurrency Control) — Snapshot isolation, used in Postgres and MySQL InnoDB
Optimistic vs Pessimistic Locking — Version columns vs SELECT FOR UPDATE
Two-Phase Commit (2PC) — Distributed transactions (see Distributed Systems)

Major Relational Databases

PostgreSQL — Open source, advanced features (JSONB, CTEs, window functions, extensions like PostGIS, pg_vector), strong SQL compliance, MVCC, extensible type system. The default choice for most new projects.
MySQL / MariaDB — Open source, widely deployed (WordPress, many web apps), InnoDB engine, good replication story, simpler than Postgres. MariaDB is a community fork with additional features.
SQLite — Open source, embedded, serverless, single-file, zero configuration. Ideal for mobile, desktop, edge, and testing. Not for concurrent write-heavy workloads.
SQL Server — Commercial (Microsoft), T-SQL, strong BI/reporting tooling, tight Azure integration, enterprise features. Free tier: SQL Server Express.
Oracle Database — Commercial, PL/SQL, RAC clustering, Exadata, mature enterprise features. Expensive licensing, strong in legacy enterprise environments.
CockroachDB — Open source core, distributed SQL (Postgres-compatible wire protocol), serializable isolation by default, multi-region, auto-sharding.
YugabyteDB — Open source, distributed SQL, Postgres-compatible, tunable consistency, multi-region.
AlloyDB — Google Cloud managed Postgres-compatible, columnar engine for analytics, Postgres wire protocol.

Partitioning (Single-Node)

Table Partitioning — Split a large table into smaller physical pieces while presenting one logical table
Range Partitioning — Partition by value range (e.g., date ranges, ID ranges). Good for time-series, archival. Risk: uneven partition sizes.
List Partitioning — Partition by specific values (e.g., region = ‘US’, ‘EU’, ‘APAC’). Good for geographic or categorical splits.
Hash Partitioning — Partition by hash of a column. Even distribution, but range queries span all partitions.
Composite Partitioning — Combine methods (e.g., range by year, then hash by user_id within each year)
Partition Pruning — Query planner skips irrelevant partitions, major performance win for large tables
Postgres Partitioning — Declarative partitioning (PARTITION BY RANGE/LIST/HASH), partition-wise joins, partition-wise aggregation
MySQL Partitioning — Supports RANGE, LIST, HASH, KEY partitioning; less flexible than Postgres
When to Partition — Tables exceeding tens of millions of rows, time-based data with retention policies, queries that naturally filter on the partition key

Choosing a Relational Database — Tradeoffs

Open Source vs Commercial — Postgres/MySQL are free, avoid vendor lock-in, large community. Oracle/SQL Server have dedicated support, enterprise tooling, but expensive licensing and lock-in.
Postgres vs MySQL — Postgres: richer SQL features, better extensibility, stricter correctness, JSONB. MySQL: simpler operations, faster simple reads, wider hosting support, more forgiving defaults.
Single-Node vs Distributed SQL — Single-node (Postgres, MySQL): simpler operations, mature tooling. Distributed SQL (CockroachDB, YugabyteDB): horizontal scale, multi-region, but more operational complexity and latency overhead.
Managed vs Self-Hosted — Managed (RDS, Cloud SQL, Aurora): less ops burden, automated backups/patching, higher cost. Self-hosted: full control, lower cost at scale, but you own uptime.
Decision Factors — Data model complexity, read/write ratio, transaction requirements, scale needs (vertical limit?), team expertise, ecosystem/tooling, licensing cost, cloud provider alignment, compliance requirements.

NoSQL Databases

Document Stores

Concept — JSON/BSON documents, flexible schema, nested data
MongoDB — Open source (SSPL license), collections, documents, aggregation pipeline, Atlas (managed). Most popular document DB, large ecosystem, sharding support.
CouchDB — Open source (Apache), multi-master replication, eventual consistency, HTTP API
Amazon DocumentDB — Managed, MongoDB-compatible API (not actual MongoDB engine)
Firestore — Google Cloud managed, real-time sync, offline support, good for mobile/web apps
Use Cases — Content management, user profiles, catalogs, event data, configuration
Tradeoffs — Flexible schema speeds development but can lead to data inconsistency. No joins means denormalization and data duplication. Better for read-heavy, evolving schemas; worse for complex transactions.

Key-Value Stores

Concept — Simple get/set by key, extremely fast, minimal query capabilities
Redis — Open source (BSD → SSPL), in-memory, data structures (lists, sets, sorted sets, streams), pub/sub, Lua scripting. Also used as cache, message broker, rate limiter.
Valkey — Open source fork of Redis (Linux Foundation), community-driven after Redis license change
DynamoDB — AWS managed, partition key + sort key, auto-scaling, single-digit ms latency, pay-per-request pricing
etcd — Open source, distributed KV store, used in Kubernetes for coordination, Raft consensus
Memcached — Open source, simple in-memory cache, multi-threaded, no persistence
Use Cases — Session storage, Caching, configuration, feature flags, leaderboards
Tradeoffs — Fastest access patterns but limited query flexibility. Redis vs Memcached: Redis has richer data structures and persistence; Memcached is simpler, multi-threaded, and slightly faster for basic caching. DynamoDB: serverless and scalable but expensive at high throughput, vendor-locked.

Column-Family Stores

Concept — Wide columns, designed for write-heavy, time-series workloads, LSM-tree storage
Apache Cassandra — Open source (Apache), distributed, tunable consistency, no single point of failure, linear scalability. Good for write-heavy workloads at massive scale.
HBase — Open source (Apache), Hadoop ecosystem, strong consistency, good for random read/write on large datasets
ScyllaDB — Open source core, C++ rewrite of Cassandra, lower tail latency, better hardware utilization
Google Bigtable — Managed, inspired Cassandra/HBase, good for analytics and time-series at Google scale
Use Cases — IoT data, event logs, time-series, messaging, recommendation data
Tradeoffs — Excellent write throughput and horizontal scale, but poor for ad-hoc queries, joins, or transactions. Data modeling is query-driven (must know access patterns upfront). Operational complexity is high.

Graph Databases

Concept — Nodes, edges, and properties; optimized for relationship traversal
Neo4j — Commercial + community edition, Cypher query language, native graph storage, ACID transactions
Amazon Neptune — Managed, supports both Gremlin (property graph) and SPARQL (RDF)
ArangoDB — Open source, multi-model (document + graph + key-value), AQL query language
Dgraph — Open source, distributed graph DB, GraphQL-native
Use Cases — Social networks, recommendation engines, knowledge graphs, fraud detection, access control
Tradeoffs — Excels at traversing deep relationships (where relational joins would be expensive). Poor for bulk analytics, aggregations, or simple CRUD. Smaller ecosystem and tooling. Consider whether Postgres recursive CTEs or a relational model suffice before reaching for a graph DB.

Time-Series Databases

Concept — Optimized for timestamped data, compression, downsampling, retention policies
InfluxDB — Open source, Flux/InfluxQL, retention policies, built for metrics and events
TimescaleDB — Open source, PostgreSQL extension (full SQL support), hypertables, good when you want time-series + relational in one DB
Prometheus — Open source, pull-based metrics collection, PromQL, designed for monitoring (see Observability)
QuestDB — Open source, high-performance ingestion, SQL interface
ClickHouse — Open source, column-oriented OLAP database, extremely fast analytical queries, also strong for time-series
Use Cases — Monitoring, IoT, financial data, sensor data, log analytics
Tradeoffs — Fast ingestion and time-range queries, but limited general-purpose query capabilities. TimescaleDB wins when you need Postgres compatibility. ClickHouse wins for analytical queries. Prometheus is purpose-built for monitoring, not general time-series storage.

Search Engines (as databases)

Elasticsearch — Open source (SSPL), full-text search, inverted indexes, aggregations, ELK stack, distributed
OpenSearch — Open source (Apache 2.0), AWS fork of Elasticsearch
Solr — Open source (Apache), Lucene-based, mature, strong for faceted search
Typesense / Meilisearch — Open source, modern developer-friendly alternatives, simpler to operate than Elasticsearch
Use Cases — Full-text search, log analysis, product search, autocomplete, observability
Tradeoffs — Best for text search and analytics, not a primary data store. Elasticsearch is powerful but operationally heavy (JVM tuning, cluster management). Typesense/Meilisearch are simpler but less featureful at scale.

Database Selection Guide

Key Decision Factors

Data Model — Relational (structured, joins, transactions) vs document (flexible, nested) vs graph (relationships) vs time-series (temporal) vs key-value (simple lookups)
Read vs Write Ratio — Read-heavy → Postgres + read replicas, caching. Write-heavy → Cassandra, DynamoDB, ClickHouse.
Consistency Requirements — Strong consistency (financial, inventory) → Postgres, CockroachDB. Eventual consistency acceptable (social feeds, analytics) → Cassandra, DynamoDB, MongoDB.
Scale — Vertical scaling sufficient (<10TB, <100K QPS) → Postgres/MySQL. Horizontal scaling needed → DynamoDB, Cassandra, CockroachDB, Vitess (MySQL sharding).
Query Complexity — Complex joins, aggregations, ad-hoc queries → Relational. Known access patterns, denormalized → NoSQL. Full-text search → Elasticsearch/Typesense.
Latency Requirements — Sub-millisecond → Redis/Memcached (in-memory). Single-digit ms → DynamoDB. Low ms → Postgres with proper indexing.
Operational Complexity — Team size and expertise. Managed services (RDS, DynamoDB, Atlas) reduce burden. Self-hosted gives control but requires expertise.
Cost — Open source: free but self-hosting costs. Managed: pay for convenience. Commercial licenses (Oracle, SQL Server): significant at scale. Serverless (DynamoDB, Firestore): pay-per-request, can be expensive at high throughput.
Ecosystem & Tooling — ORMs, migration tools, monitoring, backup solutions, community support, hiring pool.
Vendor Lock-In — Postgres, MySQL, Redis are portable. DynamoDB, Firestore, Cosmos DB are cloud-specific. Consider multi-cloud requirements.

Common Combinations (Polyglot Persistence)

Web App — Postgres (primary) + Redis (cache/sessions) + Elasticsearch (search)
IoT Platform — TimescaleDB or InfluxDB (time-series) + Postgres (metadata) + Kafka (ingestion)
Social Platform — Postgres or MySQL (users, posts) + Redis (feeds, caching) + Neo4j or Postgres (social graph) + Elasticsearch (search)
E-Commerce — Postgres (orders, inventory) + Redis (cart, sessions) + Elasticsearch (product search) + ClickHouse (analytics)
Analytics Platform — ClickHouse or BigQuery (OLAP) + Postgres (metadata) + Kafka (ingestion) + S3 (data lake)

Partitioning splits data within a single database instance. Sharding distributes data across multiple database instances. Both reduce per-node data size; sharding also distributes load. See Distributed Systems for the theoretical foundations (CAP, consistency models, consensus).

Sharding Strategies

Range-Based Sharding — Assign key ranges to shards (e.g., users A-M → shard 1, N-Z → shard 2). Simple to understand. Risk: hotspots if access is uneven (e.g., new users cluster in one range).
Hash-Based Sharding — Hash the shard key, mod by shard count. Even distribution. Downside: range queries must hit all shards (scatter-gather).
Consistent Hashing — Minimizes data movement when adding/removing shards. Used by DynamoDB, Cassandra, Riak. Virtual nodes improve balance.
Directory-Based Sharding — A lookup table maps keys to shards. Maximum flexibility but the directory is a single point of failure and bottleneck.
Geographic Sharding — Shard by region to keep data close to users. Reduces latency, aids data residency compliance (GDPR). Complicates global queries.

Choosing a Shard Key

High Cardinality — Key should have many distinct values to distribute evenly (user_id good, country bad)
Even Distribution — Avoid keys that create hotspots (e.g., timestamp as shard key → all writes hit latest shard)
Query Alignment — Most queries should include the shard key to avoid scatter-gather across all shards
Immutability — Shard key should rarely change; changing it means moving data between shards
Compound Shard Keys — Combine fields (e.g., tenant_id + user_id) for multi-tenant systems

Sharding Challenges

Cross-Shard Queries — Joins across shards are expensive or impossible; requires scatter-gather or denormalization
Cross-Shard Transactions — ACID across shards requires 2PC or saga pattern; most sharded systems avoid this (see Distributed Systems)
Rebalancing — Adding shards requires data migration; consistent hashing minimizes movement, but it’s still operationally complex
Hotspots — Uneven access patterns concentrate load on specific shards; monitor and split hot shards
Operational Complexity — Schema changes across shards, backup coordination, monitoring per-shard health
Referential Integrity — Foreign keys can’t span shards; enforce at application level or colocate related data

Sharding Tools & Approaches

Application-Level Sharding — Application code decides which shard to query. Full control but couples business logic to data distribution.
Vitess — MySQL sharding middleware (created at YouTube/Google). Transparent sharding, connection pooling, online schema changes. Powers YouTube, Slack, Square.
Citus — Postgres extension for distributed tables. Colocated joins, distributed queries, reference tables. Now part of Azure Cosmos DB for PostgreSQL.
ProxySQL / MaxScale — MySQL proxies that can route queries to shards based on rules
Native Distributed Databases — CockroachDB, YugabyteDB, Cloud Spanner, TiDB — handle sharding internally with automatic rebalancing. Trade: simpler operations but less control and some latency overhead.

Sharding vs Alternatives

Vertical Scaling First — Modern hardware (64+ cores, TBs of RAM, NVMe SSDs) handles more than most assume. Postgres on large instance handles millions of rows. Don’t shard until you must.
Read Replicas — If reads are the bottleneck, replicas may be enough without sharding (see Replication below)
Partitioning — If data is large but queries are naturally scoped (by time, tenant), table partitioning within one instance may suffice
Caching Layer — If read latency is the issue, a cache (Redis, Memcached) in front of the DB may eliminate the need to shard (see Caching)
Archive Old Data — If table size is the problem but most queries hit recent data, archive old rows to cold storage

Replication (for Data Management)

Read Replicas — Copies of the primary that serve reads. Offload read traffic. Replication lag means stale reads.
Synchronous vs Asynchronous Replication — Sync: no data loss on failover, higher write latency. Async: low latency, but risk of data loss on failover.
Streaming Replication (Postgres) — WAL-based, continuous, configurable sync/async
MySQL Replication — Binary log (binlog) replication, GTID-based for consistency
Multi-Primary / Multi-Master — Multiple writable nodes. Requires conflict resolution (last-writer-wins, CRDTs, application-level). Used in Galera Cluster, Aurora Multi-Master.
Logical Replication — Replicate specific tables/rows, cross-version, cross-platform (Postgres logical replication, Debezium CDC)

Data Modeling

Conceptual Modeling

ER Diagrams — Entities, relationships, cardinality (1:1, 1:N, N:M)
Conceptual vs Logical vs Physical Models — Abstraction levels

Relational Modeling

Normalization — Eliminating redundancy (see Relational Databases above)
Denormalization — Pre-joining for read performance
Schema Design Patterns — Star schema, snowflake schema (for data warehousing)
Migrations — Schema evolution, forward-only, reversible, zero-downtime migrations
Migration Tools — Flyway, Liquibase, Alembic, ActiveRecord Migrations, Prisma

NoSQL Modeling

Document Modeling — Embedding vs referencing, polymorphic patterns
Access Pattern-Driven Design — Design schema around queries, not entities
Single Table Design — DynamoDB pattern, overloaded GSIs
Event Sourcing — Store events as the source of truth, derive current state (see Architectural Patterns)

Caching

Caching Strategies

Cache-Aside (Lazy Loading) — App checks cache, fetches from DB on miss, writes to cache
Write-Through — Write to cache and DB simultaneously
Write-Behind (Write-Back) — Write to cache, async write to DB
Read-Through — Cache itself fetches from DB on miss
Refresh-Ahead — Proactively refresh before expiry

Cache Invalidation

TTL (Time to Live) — Expiry-based, simple but stale data risk
Event-Based Invalidation — Invalidate on write events
Cache Stampede / Thundering Herd — Locking, request coalescing, probabilistic early expiry
The Two Hard Problems — “Cache invalidation and naming things”

Caching Layers

Application-Level — In-memory (Guava, Caffeine), local process cache
Distributed Cache — Redis, Memcached, Hazelcast
CDN Caching — Static assets, edge caching, cache-control headers (see Networking)
Database Query Cache — Query result caching, materialized views
HTTP Caching — ETag, Last-Modified, Cache-Control, conditional requests
Browser Caching — Service workers, localStorage (frontend-specific)

Data Pipelines

ETL vs ELT

ETL (Extract, Transform, Load) — Transform before loading, traditional data warehousing
ELT (Extract, Load, Transform) — Load raw data, transform in-place (modern approach with cheap storage)

Batch Processing

Apache Hadoop — MapReduce, HDFS, YARN
Apache Spark — In-memory processing, RDDs, DataFrames, SparkSQL
dbt (data build tool) — SQL-first transforms, version-controlled analytics
Airflow — DAG-based workflow orchestration, scheduling

Stream Processing

Apache Kafka — Distributed log, topics, partitions, consumer groups, exactly-once semantics
Apache Flink — Stateful stream processing, event time, watermarks
Apache Pulsar — Multi-tenancy, tiered storage
Kafka Streams / ksqlDB — Stream processing as a library / SQL on streams
Event-Driven Architecture — Events as first-class citizens, CQRS, event sourcing (see Architectural Patterns)

Data Storage Tiers

Data Lake — Raw, unstructured storage (S3, ADLS, GCS)
Data Lakehouse — Combines lake + warehouse (Delta Lake, Apache Iceberg, Apache Hudi)
Data Warehouse — Structured, optimized for analytics (Snowflake, BigQuery, Redshift, Databricks)
Data Mart — Subset of warehouse for specific business domain

Data Governance & Quality

Data Catalogs — Metadata management, discovery (Amundsen, DataHub, Atlan)
Data Lineage — Tracking data flow from source to consumption
Data Quality — Validation, profiling, Great Expectations, dbt tests
Data Contracts — Schema agreements between producers and consumers
PII & Compliance — GDPR, CCPA, data masking, anonymization (see Infrastructure Security)

Change Data Capture (CDC)

Log-Based CDC — Read database transaction logs (Debezium, Maxwell)
Trigger-Based CDC — Database triggers on changes
Polling-Based CDC — Periodic queries for changes
Use Cases — Real-time sync, cache invalidation, event sourcing

Database Operations

Connection Management

Connection Pooling — Reuse database connections to avoid per-request overhead; PgBouncer (Postgres, transaction/session pooling), HikariCP (JVM, fastest JDBC pool), pgpool-II (Postgres, also does load balancing)
Connection Limits — max_connections (Postgres default 100), each connection consumes memory (~10MB), more connections ≠ more throughput
Idle Timeout — Reclaim unused connections, idle_in_transaction_session_timeout, connection lifetime limits
Pool Sizing — Formula: connections ≈ (core_count × 2) + disk_spindles; too many connections cause contention and context switching

Query Analysis

EXPLAIN / EXPLAIN ANALYZE — EXPLAIN shows planned execution; EXPLAIN ANALYZE actually executes and shows real timing, rows, loops
Query Plans — Seq Scan, Index Scan, Index Only Scan, Bitmap Scan, Nested Loop, Hash Join, Merge Join, Sort, Aggregate
Sequential vs Index Scans — Planner chooses based on selectivity; low selectivity → seq scan can be faster than index scan
Cost Estimation — startup_cost..total_cost, rows estimate, width; based on table statistics (pg_statistic)
Slow Query Logs — log_min_duration_statement (Postgres), slow_query_log (MySQL), identifying expensive queries in production

Maintenance

VACUUM / ANALYZE (Postgres) — VACUUM reclaims dead tuples from MVCC, ANALYZE updates planner statistics, autovacuum handles this automatically but may need tuning
Compaction (LSM-Tree Databases) — Merging SSTables, leveled vs size-tiered compaction, write amplification (Cassandra, RocksDB)
Index Rebuilding — REINDEX (Postgres), index fragmentation, CONCURRENTLY option to avoid locking
Table Bloat — Dead rows consuming space, pg_repack for online table compaction, monitoring bloat with pgstattuple
Statistics Collection — ANALYZE frequency, default_statistics_target, extended statistics for correlated columns

Replication Operations

Replication Lag Monitoring — pg_stat_replication (Postgres), SHOW SLAVE STATUS (MySQL), alerting on lag thresholds
Promoting Replicas — pg_promote() (Postgres), failover procedures, split-brain prevention, fencing
Failover Procedures — Automated (Patroni, repmgr, Orchestrator) vs manual, DNS updates, connection string updates
Logical vs Physical Replication — Physical: byte-for-byte WAL shipping, all databases. Logical: table-level, cross-version, selective, allows writes on replica.

Backup & Recovery

pg_dump / pg_restore — Logical backups, custom format (-Fc), parallel restore (-j), schema-only (—schema-only), selective table backup
Point-in-Time Recovery (PITR) — Continuous WAL archiving + base backup, restore to any point in time, recovery_target_time
WAL Archiving — archive_command, archive_mode, WAL-G (modern backup tool), pgBackRest (parallel backup/restore, incremental)
Snapshot-Based Backups — Storage-level snapshots (EBS, LVM), fast but requires filesystem quiesce or pg_start_backup
Backup Testing — Regularly restore backups to verify, automate restore testing, document RTO/RPO (see Disaster Recovery & Business Continuity)

Monitoring

Key Metrics — Active connections, queries/sec, cache hit ratio (should be >99%), replication lag, disk usage, transaction rate, lock waits
pg_stat_statements — Track query statistics (calls, total_time, mean_time, rows), identify top queries by time/calls
Slow Query Analysis — Identifying expensive queries, correlating with application behavior, query fingerprinting
Tools — pgMonitor, Datadog Postgres integration, CloudWatch (RDS), pgAdmin, pg_stat_activity for active queries

Schema Migrations in Production

Online DDL — Adding columns (nullable or with default), creating indexes CONCURRENTLY, ALTER TABLE challenges with large tables
Zero-Downtime Migrations — Backward-compatible changes only, no column renames/removes while old code runs
Expand-Contract Pattern — Phase 1: add new column/table (expand), Phase 2: backfill + dual-write, Phase 3: switch reads, Phase 4: remove old (contract)
Backward-Compatible Changes — Add columns (nullable), add tables, add indexes; avoid: remove columns, rename columns, add NOT NULL without default
Migration Tools — Flyway (Java, SQL-based, versioned), Alembic (Python/SQLAlchemy, autogenerate), golang-migrate, Prisma Migrate, Liquibase (XML/YAML/SQL, rollback support)

Search Systems

Inverted Index

Tokenization — Breaking text into tokens (words), analyzers, character filters, token filters
Posting Lists — For each term, a list of documents containing it, with positions and frequencies
Term Frequency (TF) — How often a term appears in a document; higher TF → more relevant
Document Frequency (DF) — How many documents contain the term; higher DF → less discriminating (common words)

Text Analysis

Analyzers — Pipeline of character filters → tokenizer → token filters; standard, simple, whitespace, language-specific, custom
Tokenizers — Standard (word boundaries), keyword (entire input as one token), pattern (regex-based), path hierarchy
Filters — Lowercase, stop words, stemming (algorithmic: Porter, Snowball), lemmatization (dictionary-based), synonyms, n-grams, edge n-grams
Stop Words — Common words (the, is, at) excluded from indexing to reduce index size and improve relevance
Synonyms — Synonym expansion at index or query time, synonym graphs, trade-off: flexibility vs index size

Ranking & Relevance

TF-IDF — Term Frequency × Inverse Document Frequency; classic relevance scoring, weights rare terms higher
BM25 — Improved TF-IDF, Okapi BM25, saturation of term frequency, document length normalization; default in Elasticsearch/Lucene
Vector Similarity — Cosine similarity, dot product, Euclidean distance; used for semantic search with embeddings
Learning to Rank — Use ML to learn ranking functions from user behavior (clicks, conversions), feature engineering, RankNet, LambdaMART
Boosting — Increase/decrease relevance of specific fields or query clauses, function scoring, decay functions (recency, proximity)

Search Architecture

Sharding — Distribute index across nodes for scale, routing queries to relevant shards
Replication — Copies of each shard for fault tolerance and read throughput
Near-Real-Time Indexing — Refresh interval (default 1s in Elasticsearch), trade-off between indexing latency and search visibility
Segment Merging — Lucene segments, background merging, force merge for read-heavy indexes, write amplification

Query Features

Full-Text Search — Match queries, multi-match, phrase matching, slop (proximity), fuzziness
Faceted Search — Aggregations for filtering (category counts, price ranges, ratings), post-filter for facet accuracy
Autocomplete / Typeahead — Edge n-grams, completion suggesters, search-as-you-type fields, debouncing
Fuzzy Matching — Edit distance (Levenshtein), configurable fuzziness, phonetic matching (Soundex, Metaphone)
Highlighting — Return matching fragments with highlighted terms, configurable pre/post tags
Aggregations — Metric (avg, sum, min, max), bucket (terms, histogram, date_histogram, range), pipeline aggregations

Tools

Elasticsearch / OpenSearch — Distributed, Lucene-based, REST API, rich query DSL, aggregations, ELK/EFK stack, Kibana/OpenSearch Dashboards
Solr — Apache Lucene-based, mature, SolrCloud for distributed mode, strong faceted search
Meilisearch — Rust-based, typo-tolerant, instant search, simple API, good developer experience, limited aggregations
Typesense — C++-based, typo-tolerant, easy to operate, geo search, good for small-medium datasets
PostgreSQL Full-Text Search — tsvector/tsquery, GIN indexes, ts_rank, good enough for many applications without a separate search service

Vector Search

Embedding-Based Retrieval — Convert text/images to dense vectors, search by similarity rather than keyword matching (see Natural Language Processing)
ANN Algorithms — HNSW (hierarchical navigable small worlds, high recall), IVF (inverted file index, faster at scale), PQ (product quantization, compression)
pgvector — PostgreSQL extension for vector similarity search, ivfflat and hnsw indexes, integrates with existing Postgres queries
Dedicated Vector Databases — Pinecone (managed, serverless), Weaviate (open source, hybrid search), Qdrant (Rust-based, filtering), Milvus (open source, GPU-accelerated), Chroma (lightweight, Python-native)

Data Serialization Formats

Text Formats

JSON — Schema-less, human-readable, verbose, ubiquitous (APIs, config, NoSQL), no comments (JSONC/JSON5 add them), limited types (no dates, binary)
XML — Schema support (XSD), namespaces, XSLT transforms, verbose, SOAP/legacy APIs, still used in enterprise/config (Maven pom.xml, Android layouts)
YAML — Superset of JSON, indentation-based, anchors/aliases, multi-document, used in config (Kubernetes, CI/CD, Ansible), gotchas (Norway problem, implicit typing)
TOML — Minimal, unambiguous, explicit types, good for config files (Cargo.toml, pyproject.toml), less suitable for deeply nested data

Binary Formats

Protocol Buffers (Protobuf) — Schema-defined (.proto files), compact binary encoding, backward/forward compatible, code generation, gRPC default (see API Design)
Avro — Schema evolution (reader/writer schemas), self-describing (schema embedded), Kafka default serialization, good for data pipelines
MessagePack — Binary JSON, compact, schema-less, faster than JSON, cross-language, used in Redis/Fluentd
CBOR — Concise Binary Object Representation, IETF standard (RFC 8949), self-describing, designed for constrained environments (IoT)
FlatBuffers — Zero-copy deserialization (access data without unpacking), Google, used in games and performance-critical paths
Cap’n Proto — Zero-copy, no encoding/decoding step, mmap-friendly, created by original Protobuf author

Columnar Formats

Parquet — Column-oriented, efficient compression (per-column encoding), predicate pushdown, row groups, Apache ecosystem (Spark, Hive, Presto), de facto standard for analytics
ORC — Optimized Row Columnar, Hive-optimized, ACID support, similar to Parquet, stronger in Hive ecosystem
Arrow — In-memory columnar format, zero-copy IPC (inter-process communication), language-agnostic, standard for data interchange between systems (Pandas, Spark, DuckDB)

Schema Management

Schema Registries — Confluent Schema Registry (Kafka), store and version schemas, enforce compatibility at produce time
Compatibility Modes — Backward (new schema reads old data), forward (old schema reads new data), full (both), none
Schema Evolution Strategies — Add optional fields, never remove required fields, use field IDs not names (Protobuf), union types (Avro)

Choosing a Format

Human-Readability vs Performance — JSON/YAML for config and debugging, Protobuf/Avro for production data, Parquet for analytics
Schema Enforcement — Protobuf/Avro enforce schemas and catch errors early; JSON/YAML are flexible but error-prone
Language Support — JSON universal, Protobuf/Avro have code generation for most languages, Parquet/Arrow have strong Python/JVM support
Use Case Mapping — API payloads (JSON, Protobuf), config files (YAML, TOML), event streaming (Avro, Protobuf), analytics storage (Parquet), in-memory processing (Arrow)

data databases caching pipelines database-selection sharding partitioning replication search serialization database-operations

Explorer

03 - Data Management MOC