03 — Data Management MOC

← Back to Software Engineering - Map of Content

Storing, retrieving, transforming, and moving data. Arguably the most critical concern in most applications — get this wrong and nothing else matters.


Relational Databases

SQL Fundamentals

  • DDL (Data Definition Language) — CREATE, ALTER, DROP, TRUNCATE
  • DML (Data Manipulation Language) — SELECT, INSERT, UPDATE, DELETE
  • Joins — INNER, LEFT/RIGHT OUTER, FULL OUTER, CROSS, self-joins
  • Subqueries — Correlated, non-correlated, EXISTS, IN
  • Aggregation — GROUP BY, HAVING, window functions (ROW_NUMBER, RANK, LAG, LEAD, SUM OVER)
  • CTEs (Common Table Expressions) — WITH clauses, recursive CTEs
  • Set Operations — UNION, INTERSECT, EXCEPT

Normalization

  • 1NF — Atomic values, no repeating groups
  • 2NF — No partial dependencies on composite keys
  • 3NF — No transitive dependencies
  • BCNF — Every determinant is a candidate key
  • Denormalization — Strategic redundancy for read performance (used in Data Modeling)

Indexing

  • B-Tree Indexes — Default, good for range queries and equality
  • Hash Indexes — O(1) equality lookups, no range support
  • Composite Indexes — Multi-column, leftmost prefix rule
  • Covering Indexes — Include all queried columns, avoid table lookups
  • Partial/Filtered Indexes — Index a subset of rows
  • Full-Text Indexes — Text search with ranking (tsvector in Postgres)
  • GIN/GiST Indexes — JSONB, arrays, geometric data (Postgres)
  • Index Anti-Patterns — Over-indexing, unused indexes, low selectivity

Query Optimization

  • Query Plans — EXPLAIN / EXPLAIN ANALYZE
  • Query Planner — Cost-based optimization, statistics, cardinality estimation
  • Common Optimizations — Index selection, join ordering, predicate pushdown
  • N+1 Query Problem — Eager loading, batch loading, DataLoader pattern
  • Slow Query Analysis — Query logs, pg_stat_statements, slow query logs

Transactions & Concurrency

  • ACID Properties — Atomicity, Consistency, Isolation, Durability
  • Isolation Levels — Read Uncommitted, Read Committed, Repeatable Read, Serializable
  • Concurrency Phenomena — Dirty reads, non-repeatable reads, phantom reads, write skew
  • Locking — Row-level, table-level, advisory locks, deadlock detection
  • MVCC (Multi-Version Concurrency Control) — Snapshot isolation, used in Postgres and MySQL InnoDB
  • Optimistic vs Pessimistic Locking — Version columns vs SELECT FOR UPDATE
  • Two-Phase Commit (2PC) — Distributed transactions (see Distributed Systems)

Major Relational Databases

  • PostgreSQL — Open source, advanced features (JSONB, CTEs, window functions, extensions like PostGIS, pg_vector), strong SQL compliance, MVCC, extensible type system. The default choice for most new projects.
  • MySQL / MariaDB — Open source, widely deployed (WordPress, many web apps), InnoDB engine, good replication story, simpler than Postgres. MariaDB is a community fork with additional features.
  • SQLite — Open source, embedded, serverless, single-file, zero configuration. Ideal for mobile, desktop, edge, and testing. Not for concurrent write-heavy workloads.
  • SQL Server — Commercial (Microsoft), T-SQL, strong BI/reporting tooling, tight Azure integration, enterprise features. Free tier: SQL Server Express.
  • Oracle Database — Commercial, PL/SQL, RAC clustering, Exadata, mature enterprise features. Expensive licensing, strong in legacy enterprise environments.
  • CockroachDB — Open source core, distributed SQL (Postgres-compatible wire protocol), serializable isolation by default, multi-region, auto-sharding.
  • YugabyteDB — Open source, distributed SQL, Postgres-compatible, tunable consistency, multi-region.
  • AlloyDB — Google Cloud managed Postgres-compatible, columnar engine for analytics, Postgres wire protocol.

Partitioning (Single-Node)

  • Table Partitioning — Split a large table into smaller physical pieces while presenting one logical table
  • Range Partitioning — Partition by value range (e.g., date ranges, ID ranges). Good for time-series, archival. Risk: uneven partition sizes.
  • List Partitioning — Partition by specific values (e.g., region = ‘US’, ‘EU’, ‘APAC’). Good for geographic or categorical splits.
  • Hash Partitioning — Partition by hash of a column. Even distribution, but range queries span all partitions.
  • Composite Partitioning — Combine methods (e.g., range by year, then hash by user_id within each year)
  • Partition Pruning — Query planner skips irrelevant partitions, major performance win for large tables
  • Postgres Partitioning — Declarative partitioning (PARTITION BY RANGE/LIST/HASH), partition-wise joins, partition-wise aggregation
  • MySQL Partitioning — Supports RANGE, LIST, HASH, KEY partitioning; less flexible than Postgres
  • When to Partition — Tables exceeding tens of millions of rows, time-based data with retention policies, queries that naturally filter on the partition key

Choosing a Relational Database — Tradeoffs

  • Open Source vs Commercial — Postgres/MySQL are free, avoid vendor lock-in, large community. Oracle/SQL Server have dedicated support, enterprise tooling, but expensive licensing and lock-in.
  • Postgres vs MySQL — Postgres: richer SQL features, better extensibility, stricter correctness, JSONB. MySQL: simpler operations, faster simple reads, wider hosting support, more forgiving defaults.
  • Single-Node vs Distributed SQL — Single-node (Postgres, MySQL): simpler operations, mature tooling. Distributed SQL (CockroachDB, YugabyteDB): horizontal scale, multi-region, but more operational complexity and latency overhead.
  • Managed vs Self-Hosted — Managed (RDS, Cloud SQL, Aurora): less ops burden, automated backups/patching, higher cost. Self-hosted: full control, lower cost at scale, but you own uptime.
  • Decision Factors — Data model complexity, read/write ratio, transaction requirements, scale needs (vertical limit?), team expertise, ecosystem/tooling, licensing cost, cloud provider alignment, compliance requirements.

NoSQL Databases

Document Stores

  • Concept — JSON/BSON documents, flexible schema, nested data
  • MongoDB — Open source (SSPL license), collections, documents, aggregation pipeline, Atlas (managed). Most popular document DB, large ecosystem, sharding support.
  • CouchDB — Open source (Apache), multi-master replication, eventual consistency, HTTP API
  • Amazon DocumentDB — Managed, MongoDB-compatible API (not actual MongoDB engine)
  • Firestore — Google Cloud managed, real-time sync, offline support, good for mobile/web apps
  • Use Cases — Content management, user profiles, catalogs, event data, configuration
  • Tradeoffs — Flexible schema speeds development but can lead to data inconsistency. No joins means denormalization and data duplication. Better for read-heavy, evolving schemas; worse for complex transactions.

Key-Value Stores

  • Concept — Simple get/set by key, extremely fast, minimal query capabilities
  • Redis — Open source (BSD → SSPL), in-memory, data structures (lists, sets, sorted sets, streams), pub/sub, Lua scripting. Also used as cache, message broker, rate limiter.
  • Valkey — Open source fork of Redis (Linux Foundation), community-driven after Redis license change
  • DynamoDB — AWS managed, partition key + sort key, auto-scaling, single-digit ms latency, pay-per-request pricing
  • etcd — Open source, distributed KV store, used in Kubernetes for coordination, Raft consensus
  • Memcached — Open source, simple in-memory cache, multi-threaded, no persistence
  • Use Cases — Session storage, Caching, configuration, feature flags, leaderboards
  • Tradeoffs — Fastest access patterns but limited query flexibility. Redis vs Memcached: Redis has richer data structures and persistence; Memcached is simpler, multi-threaded, and slightly faster for basic caching. DynamoDB: serverless and scalable but expensive at high throughput, vendor-locked.

Column-Family Stores

  • Concept — Wide columns, designed for write-heavy, time-series workloads, LSM-tree storage
  • Apache Cassandra — Open source (Apache), distributed, tunable consistency, no single point of failure, linear scalability. Good for write-heavy workloads at massive scale.
  • HBase — Open source (Apache), Hadoop ecosystem, strong consistency, good for random read/write on large datasets
  • ScyllaDB — Open source core, C++ rewrite of Cassandra, lower tail latency, better hardware utilization
  • Google Bigtable — Managed, inspired Cassandra/HBase, good for analytics and time-series at Google scale
  • Use Cases — IoT data, event logs, time-series, messaging, recommendation data
  • Tradeoffs — Excellent write throughput and horizontal scale, but poor for ad-hoc queries, joins, or transactions. Data modeling is query-driven (must know access patterns upfront). Operational complexity is high.

Graph Databases

  • Concept — Nodes, edges, and properties; optimized for relationship traversal
  • Neo4j — Commercial + community edition, Cypher query language, native graph storage, ACID transactions
  • Amazon Neptune — Managed, supports both Gremlin (property graph) and SPARQL (RDF)
  • ArangoDB — Open source, multi-model (document + graph + key-value), AQL query language
  • Dgraph — Open source, distributed graph DB, GraphQL-native
  • Use Cases — Social networks, recommendation engines, knowledge graphs, fraud detection, access control
  • Tradeoffs — Excels at traversing deep relationships (where relational joins would be expensive). Poor for bulk analytics, aggregations, or simple CRUD. Smaller ecosystem and tooling. Consider whether Postgres recursive CTEs or a relational model suffice before reaching for a graph DB.

Time-Series Databases

  • Concept — Optimized for timestamped data, compression, downsampling, retention policies
  • InfluxDB — Open source, Flux/InfluxQL, retention policies, built for metrics and events
  • TimescaleDB — Open source, PostgreSQL extension (full SQL support), hypertables, good when you want time-series + relational in one DB
  • Prometheus — Open source, pull-based metrics collection, PromQL, designed for monitoring (see Observability)
  • QuestDB — Open source, high-performance ingestion, SQL interface
  • ClickHouse — Open source, column-oriented OLAP database, extremely fast analytical queries, also strong for time-series
  • Use Cases — Monitoring, IoT, financial data, sensor data, log analytics
  • Tradeoffs — Fast ingestion and time-range queries, but limited general-purpose query capabilities. TimescaleDB wins when you need Postgres compatibility. ClickHouse wins for analytical queries. Prometheus is purpose-built for monitoring, not general time-series storage.

Search Engines (as databases)

  • Elasticsearch — Open source (SSPL), full-text search, inverted indexes, aggregations, ELK stack, distributed
  • OpenSearch — Open source (Apache 2.0), AWS fork of Elasticsearch
  • Solr — Open source (Apache), Lucene-based, mature, strong for faceted search
  • Typesense / Meilisearch — Open source, modern developer-friendly alternatives, simpler to operate than Elasticsearch
  • Use Cases — Full-text search, log analysis, product search, autocomplete, observability
  • Tradeoffs — Best for text search and analytics, not a primary data store. Elasticsearch is powerful but operationally heavy (JVM tuning, cluster management). Typesense/Meilisearch are simpler but less featureful at scale.

Database Selection Guide

Key Decision Factors

  • Data Model — Relational (structured, joins, transactions) vs document (flexible, nested) vs graph (relationships) vs time-series (temporal) vs key-value (simple lookups)
  • Read vs Write Ratio — Read-heavy → Postgres + read replicas, caching. Write-heavy → Cassandra, DynamoDB, ClickHouse.
  • Consistency Requirements — Strong consistency (financial, inventory) → Postgres, CockroachDB. Eventual consistency acceptable (social feeds, analytics) → Cassandra, DynamoDB, MongoDB.
  • Scale — Vertical scaling sufficient (<10TB, <100K QPS) → Postgres/MySQL. Horizontal scaling needed → DynamoDB, Cassandra, CockroachDB, Vitess (MySQL sharding).
  • Query Complexity — Complex joins, aggregations, ad-hoc queries → Relational. Known access patterns, denormalized → NoSQL. Full-text search → Elasticsearch/Typesense.
  • Latency Requirements — Sub-millisecond → Redis/Memcached (in-memory). Single-digit ms → DynamoDB. Low ms → Postgres with proper indexing.
  • Operational Complexity — Team size and expertise. Managed services (RDS, DynamoDB, Atlas) reduce burden. Self-hosted gives control but requires expertise.
  • Cost — Open source: free but self-hosting costs. Managed: pay for convenience. Commercial licenses (Oracle, SQL Server): significant at scale. Serverless (DynamoDB, Firestore): pay-per-request, can be expensive at high throughput.
  • Ecosystem & Tooling — ORMs, migration tools, monitoring, backup solutions, community support, hiring pool.
  • Vendor Lock-In — Postgres, MySQL, Redis are portable. DynamoDB, Firestore, Cosmos DB are cloud-specific. Consider multi-cloud requirements.

Common Combinations (Polyglot Persistence)

  • Web App — Postgres (primary) + Redis (cache/sessions) + Elasticsearch (search)
  • IoT Platform — TimescaleDB or InfluxDB (time-series) + Postgres (metadata) + Kafka (ingestion)
  • Social Platform — Postgres or MySQL (users, posts) + Redis (feeds, caching) + Neo4j or Postgres (social graph) + Elasticsearch (search)
  • E-Commerce — Postgres (orders, inventory) + Redis (cart, sessions) + Elasticsearch (product search) + ClickHouse (analytics)
  • Analytics Platform — ClickHouse or BigQuery (OLAP) + Postgres (metadata) + Kafka (ingestion) + S3 (data lake)

Sharding and Partitioning

Partitioning splits data within a single database instance. Sharding distributes data across multiple database instances. Both reduce per-node data size; sharding also distributes load. See Distributed Systems for the theoretical foundations (CAP, consistency models, consensus).

Sharding Strategies

  • Range-Based Sharding — Assign key ranges to shards (e.g., users A-M → shard 1, N-Z → shard 2). Simple to understand. Risk: hotspots if access is uneven (e.g., new users cluster in one range).
  • Hash-Based Sharding — Hash the shard key, mod by shard count. Even distribution. Downside: range queries must hit all shards (scatter-gather).
  • Consistent Hashing — Minimizes data movement when adding/removing shards. Used by DynamoDB, Cassandra, Riak. Virtual nodes improve balance.
  • Directory-Based Sharding — A lookup table maps keys to shards. Maximum flexibility but the directory is a single point of failure and bottleneck.
  • Geographic Sharding — Shard by region to keep data close to users. Reduces latency, aids data residency compliance (GDPR). Complicates global queries.

Choosing a Shard Key

  • High Cardinality — Key should have many distinct values to distribute evenly (user_id good, country bad)
  • Even Distribution — Avoid keys that create hotspots (e.g., timestamp as shard key → all writes hit latest shard)
  • Query Alignment — Most queries should include the shard key to avoid scatter-gather across all shards
  • Immutability — Shard key should rarely change; changing it means moving data between shards
  • Compound Shard Keys — Combine fields (e.g., tenant_id + user_id) for multi-tenant systems

Sharding Challenges

  • Cross-Shard Queries — Joins across shards are expensive or impossible; requires scatter-gather or denormalization
  • Cross-Shard Transactions — ACID across shards requires 2PC or saga pattern; most sharded systems avoid this (see Distributed Systems)
  • Rebalancing — Adding shards requires data migration; consistent hashing minimizes movement, but it’s still operationally complex
  • Hotspots — Uneven access patterns concentrate load on specific shards; monitor and split hot shards
  • Operational Complexity — Schema changes across shards, backup coordination, monitoring per-shard health
  • Referential Integrity — Foreign keys can’t span shards; enforce at application level or colocate related data

Sharding Tools & Approaches

  • Application-Level Sharding — Application code decides which shard to query. Full control but couples business logic to data distribution.
  • Vitess — MySQL sharding middleware (created at YouTube/Google). Transparent sharding, connection pooling, online schema changes. Powers YouTube, Slack, Square.
  • Citus — Postgres extension for distributed tables. Colocated joins, distributed queries, reference tables. Now part of Azure Cosmos DB for PostgreSQL.
  • ProxySQL / MaxScale — MySQL proxies that can route queries to shards based on rules
  • Native Distributed Databases — CockroachDB, YugabyteDB, Cloud Spanner, TiDB — handle sharding internally with automatic rebalancing. Trade: simpler operations but less control and some latency overhead.

Sharding vs Alternatives

  • Vertical Scaling First — Modern hardware (64+ cores, TBs of RAM, NVMe SSDs) handles more than most assume. Postgres on large instance handles millions of rows. Don’t shard until you must.
  • Read Replicas — If reads are the bottleneck, replicas may be enough without sharding (see Replication below)
  • Partitioning — If data is large but queries are naturally scoped (by time, tenant), table partitioning within one instance may suffice
  • Caching Layer — If read latency is the issue, a cache (Redis, Memcached) in front of the DB may eliminate the need to shard (see Caching)
  • Archive Old Data — If table size is the problem but most queries hit recent data, archive old rows to cold storage

Replication (for Data Management)

  • Read Replicas — Copies of the primary that serve reads. Offload read traffic. Replication lag means stale reads.
  • Synchronous vs Asynchronous Replication — Sync: no data loss on failover, higher write latency. Async: low latency, but risk of data loss on failover.
  • Streaming Replication (Postgres) — WAL-based, continuous, configurable sync/async
  • MySQL Replication — Binary log (binlog) replication, GTID-based for consistency
  • Multi-Primary / Multi-Master — Multiple writable nodes. Requires conflict resolution (last-writer-wins, CRDTs, application-level). Used in Galera Cluster, Aurora Multi-Master.
  • Logical Replication — Replicate specific tables/rows, cross-version, cross-platform (Postgres logical replication, Debezium CDC)

Data Modeling

Conceptual Modeling

  • ER Diagrams — Entities, relationships, cardinality (1:1, 1:N, N:M)
  • Conceptual vs Logical vs Physical Models — Abstraction levels

Relational Modeling

  • Normalization — Eliminating redundancy (see Relational Databases above)
  • Denormalization — Pre-joining for read performance
  • Schema Design Patterns — Star schema, snowflake schema (for data warehousing)
  • Migrations — Schema evolution, forward-only, reversible, zero-downtime migrations
  • Migration Tools — Flyway, Liquibase, Alembic, ActiveRecord Migrations, Prisma

NoSQL Modeling

  • Document Modeling — Embedding vs referencing, polymorphic patterns
  • Access Pattern-Driven Design — Design schema around queries, not entities
  • Single Table Design — DynamoDB pattern, overloaded GSIs
  • Event Sourcing — Store events as the source of truth, derive current state (see Architectural Patterns)

Caching

Caching Strategies

  • Cache-Aside (Lazy Loading) — App checks cache, fetches from DB on miss, writes to cache
  • Write-Through — Write to cache and DB simultaneously
  • Write-Behind (Write-Back) — Write to cache, async write to DB
  • Read-Through — Cache itself fetches from DB on miss
  • Refresh-Ahead — Proactively refresh before expiry

Cache Invalidation

  • TTL (Time to Live) — Expiry-based, simple but stale data risk
  • Event-Based Invalidation — Invalidate on write events
  • Cache Stampede / Thundering Herd — Locking, request coalescing, probabilistic early expiry
  • The Two Hard Problems — “Cache invalidation and naming things”

Caching Layers

  • Application-Level — In-memory (Guava, Caffeine), local process cache
  • Distributed Cache — Redis, Memcached, Hazelcast
  • CDN Caching — Static assets, edge caching, cache-control headers (see Networking)
  • Database Query Cache — Query result caching, materialized views
  • HTTP Caching — ETag, Last-Modified, Cache-Control, conditional requests
  • Browser Caching — Service workers, localStorage (frontend-specific)

Data Pipelines

ETL vs ELT

  • ETL (Extract, Transform, Load) — Transform before loading, traditional data warehousing
  • ELT (Extract, Load, Transform) — Load raw data, transform in-place (modern approach with cheap storage)

Batch Processing

  • Apache Hadoop — MapReduce, HDFS, YARN
  • Apache Spark — In-memory processing, RDDs, DataFrames, SparkSQL
  • dbt (data build tool) — SQL-first transforms, version-controlled analytics
  • Airflow — DAG-based workflow orchestration, scheduling

Stream Processing

  • Apache Kafka — Distributed log, topics, partitions, consumer groups, exactly-once semantics
  • Apache Flink — Stateful stream processing, event time, watermarks
  • Apache Pulsar — Multi-tenancy, tiered storage
  • Kafka Streams / ksqlDB — Stream processing as a library / SQL on streams
  • Event-Driven Architecture — Events as first-class citizens, CQRS, event sourcing (see Architectural Patterns)

Data Storage Tiers

  • Data Lake — Raw, unstructured storage (S3, ADLS, GCS)
  • Data Lakehouse — Combines lake + warehouse (Delta Lake, Apache Iceberg, Apache Hudi)
  • Data Warehouse — Structured, optimized for analytics (Snowflake, BigQuery, Redshift, Databricks)
  • Data Mart — Subset of warehouse for specific business domain

Data Governance & Quality

  • Data Catalogs — Metadata management, discovery (Amundsen, DataHub, Atlan)
  • Data Lineage — Tracking data flow from source to consumption
  • Data Quality — Validation, profiling, Great Expectations, dbt tests
  • Data Contracts — Schema agreements between producers and consumers
  • PII & Compliance — GDPR, CCPA, data masking, anonymization (see Infrastructure Security)

Change Data Capture (CDC)

  • Log-Based CDC — Read database transaction logs (Debezium, Maxwell)
  • Trigger-Based CDC — Database triggers on changes
  • Polling-Based CDC — Periodic queries for changes
  • Use Cases — Real-time sync, cache invalidation, event sourcing

Database Operations

Connection Management

  • Connection Pooling — Reuse database connections to avoid per-request overhead; PgBouncer (Postgres, transaction/session pooling), HikariCP (JVM, fastest JDBC pool), pgpool-II (Postgres, also does load balancing)
  • Connection Limits — max_connections (Postgres default 100), each connection consumes memory (~10MB), more connections ≠ more throughput
  • Idle Timeout — Reclaim unused connections, idle_in_transaction_session_timeout, connection lifetime limits
  • Pool Sizing — Formula: connections ≈ (core_count × 2) + disk_spindles; too many connections cause contention and context switching

Query Analysis

  • EXPLAIN / EXPLAIN ANALYZE — EXPLAIN shows planned execution; EXPLAIN ANALYZE actually executes and shows real timing, rows, loops
  • Query Plans — Seq Scan, Index Scan, Index Only Scan, Bitmap Scan, Nested Loop, Hash Join, Merge Join, Sort, Aggregate
  • Sequential vs Index Scans — Planner chooses based on selectivity; low selectivity → seq scan can be faster than index scan
  • Cost Estimation — startup_cost..total_cost, rows estimate, width; based on table statistics (pg_statistic)
  • Slow Query Logs — log_min_duration_statement (Postgres), slow_query_log (MySQL), identifying expensive queries in production

Maintenance

  • VACUUM / ANALYZE (Postgres) — VACUUM reclaims dead tuples from MVCC, ANALYZE updates planner statistics, autovacuum handles this automatically but may need tuning
  • Compaction (LSM-Tree Databases) — Merging SSTables, leveled vs size-tiered compaction, write amplification (Cassandra, RocksDB)
  • Index Rebuilding — REINDEX (Postgres), index fragmentation, CONCURRENTLY option to avoid locking
  • Table Bloat — Dead rows consuming space, pg_repack for online table compaction, monitoring bloat with pgstattuple
  • Statistics Collection — ANALYZE frequency, default_statistics_target, extended statistics for correlated columns

Replication Operations

  • Replication Lag Monitoring — pg_stat_replication (Postgres), SHOW SLAVE STATUS (MySQL), alerting on lag thresholds
  • Promoting Replicas — pg_promote() (Postgres), failover procedures, split-brain prevention, fencing
  • Failover Procedures — Automated (Patroni, repmgr, Orchestrator) vs manual, DNS updates, connection string updates
  • Logical vs Physical Replication — Physical: byte-for-byte WAL shipping, all databases. Logical: table-level, cross-version, selective, allows writes on replica.

Backup & Recovery

  • pg_dump / pg_restore — Logical backups, custom format (-Fc), parallel restore (-j), schema-only (—schema-only), selective table backup
  • Point-in-Time Recovery (PITR) — Continuous WAL archiving + base backup, restore to any point in time, recovery_target_time
  • WAL Archiving — archive_command, archive_mode, WAL-G (modern backup tool), pgBackRest (parallel backup/restore, incremental)
  • Snapshot-Based Backups — Storage-level snapshots (EBS, LVM), fast but requires filesystem quiesce or pg_start_backup
  • Backup Testing — Regularly restore backups to verify, automate restore testing, document RTO/RPO (see Disaster Recovery & Business Continuity)

Monitoring

  • Key Metrics — Active connections, queries/sec, cache hit ratio (should be >99%), replication lag, disk usage, transaction rate, lock waits
  • pg_stat_statements — Track query statistics (calls, total_time, mean_time, rows), identify top queries by time/calls
  • Slow Query Analysis — Identifying expensive queries, correlating with application behavior, query fingerprinting
  • Tools — pgMonitor, Datadog Postgres integration, CloudWatch (RDS), pgAdmin, pg_stat_activity for active queries

Schema Migrations in Production

  • Online DDL — Adding columns (nullable or with default), creating indexes CONCURRENTLY, ALTER TABLE challenges with large tables
  • Zero-Downtime Migrations — Backward-compatible changes only, no column renames/removes while old code runs
  • Expand-Contract Pattern — Phase 1: add new column/table (expand), Phase 2: backfill + dual-write, Phase 3: switch reads, Phase 4: remove old (contract)
  • Backward-Compatible Changes — Add columns (nullable), add tables, add indexes; avoid: remove columns, rename columns, add NOT NULL without default
  • Migration Tools — Flyway (Java, SQL-based, versioned), Alembic (Python/SQLAlchemy, autogenerate), golang-migrate, Prisma Migrate, Liquibase (XML/YAML/SQL, rollback support)

Search Systems

Inverted Index

  • Tokenization — Breaking text into tokens (words), analyzers, character filters, token filters
  • Posting Lists — For each term, a list of documents containing it, with positions and frequencies
  • Term Frequency (TF) — How often a term appears in a document; higher TF → more relevant
  • Document Frequency (DF) — How many documents contain the term; higher DF → less discriminating (common words)

Text Analysis

  • Analyzers — Pipeline of character filters → tokenizer → token filters; standard, simple, whitespace, language-specific, custom
  • Tokenizers — Standard (word boundaries), keyword (entire input as one token), pattern (regex-based), path hierarchy
  • Filters — Lowercase, stop words, stemming (algorithmic: Porter, Snowball), lemmatization (dictionary-based), synonyms, n-grams, edge n-grams
  • Stop Words — Common words (the, is, at) excluded from indexing to reduce index size and improve relevance
  • Synonyms — Synonym expansion at index or query time, synonym graphs, trade-off: flexibility vs index size

Ranking & Relevance

  • TF-IDF — Term Frequency × Inverse Document Frequency; classic relevance scoring, weights rare terms higher
  • BM25 — Improved TF-IDF, Okapi BM25, saturation of term frequency, document length normalization; default in Elasticsearch/Lucene
  • Vector Similarity — Cosine similarity, dot product, Euclidean distance; used for semantic search with embeddings
  • Learning to Rank — Use ML to learn ranking functions from user behavior (clicks, conversions), feature engineering, RankNet, LambdaMART
  • Boosting — Increase/decrease relevance of specific fields or query clauses, function scoring, decay functions (recency, proximity)

Search Architecture

  • Sharding — Distribute index across nodes for scale, routing queries to relevant shards
  • Replication — Copies of each shard for fault tolerance and read throughput
  • Near-Real-Time Indexing — Refresh interval (default 1s in Elasticsearch), trade-off between indexing latency and search visibility
  • Segment Merging — Lucene segments, background merging, force merge for read-heavy indexes, write amplification

Query Features

  • Full-Text Search — Match queries, multi-match, phrase matching, slop (proximity), fuzziness
  • Faceted Search — Aggregations for filtering (category counts, price ranges, ratings), post-filter for facet accuracy
  • Autocomplete / Typeahead — Edge n-grams, completion suggesters, search-as-you-type fields, debouncing
  • Fuzzy Matching — Edit distance (Levenshtein), configurable fuzziness, phonetic matching (Soundex, Metaphone)
  • Highlighting — Return matching fragments with highlighted terms, configurable pre/post tags
  • Aggregations — Metric (avg, sum, min, max), bucket (terms, histogram, date_histogram, range), pipeline aggregations

Tools

  • Elasticsearch / OpenSearch — Distributed, Lucene-based, REST API, rich query DSL, aggregations, ELK/EFK stack, Kibana/OpenSearch Dashboards
  • Solr — Apache Lucene-based, mature, SolrCloud for distributed mode, strong faceted search
  • Meilisearch — Rust-based, typo-tolerant, instant search, simple API, good developer experience, limited aggregations
  • Typesense — C++-based, typo-tolerant, easy to operate, geo search, good for small-medium datasets
  • PostgreSQL Full-Text Search — tsvector/tsquery, GIN indexes, ts_rank, good enough for many applications without a separate search service
  • Embedding-Based Retrieval — Convert text/images to dense vectors, search by similarity rather than keyword matching (see Natural Language Processing)
  • ANN Algorithms — HNSW (hierarchical navigable small worlds, high recall), IVF (inverted file index, faster at scale), PQ (product quantization, compression)
  • pgvector — PostgreSQL extension for vector similarity search, ivfflat and hnsw indexes, integrates with existing Postgres queries
  • Dedicated Vector Databases — Pinecone (managed, serverless), Weaviate (open source, hybrid search), Qdrant (Rust-based, filtering), Milvus (open source, GPU-accelerated), Chroma (lightweight, Python-native)

Data Serialization Formats

Text Formats

  • JSON — Schema-less, human-readable, verbose, ubiquitous (APIs, config, NoSQL), no comments (JSONC/JSON5 add them), limited types (no dates, binary)
  • XML — Schema support (XSD), namespaces, XSLT transforms, verbose, SOAP/legacy APIs, still used in enterprise/config (Maven pom.xml, Android layouts)
  • YAML — Superset of JSON, indentation-based, anchors/aliases, multi-document, used in config (Kubernetes, CI/CD, Ansible), gotchas (Norway problem, implicit typing)
  • TOML — Minimal, unambiguous, explicit types, good for config files (Cargo.toml, pyproject.toml), less suitable for deeply nested data

Binary Formats

  • Protocol Buffers (Protobuf) — Schema-defined (.proto files), compact binary encoding, backward/forward compatible, code generation, gRPC default (see API Design)
  • Avro — Schema evolution (reader/writer schemas), self-describing (schema embedded), Kafka default serialization, good for data pipelines
  • MessagePack — Binary JSON, compact, schema-less, faster than JSON, cross-language, used in Redis/Fluentd
  • CBOR — Concise Binary Object Representation, IETF standard (RFC 8949), self-describing, designed for constrained environments (IoT)
  • FlatBuffers — Zero-copy deserialization (access data without unpacking), Google, used in games and performance-critical paths
  • Cap’n Proto — Zero-copy, no encoding/decoding step, mmap-friendly, created by original Protobuf author

Columnar Formats

  • Parquet — Column-oriented, efficient compression (per-column encoding), predicate pushdown, row groups, Apache ecosystem (Spark, Hive, Presto), de facto standard for analytics
  • ORC — Optimized Row Columnar, Hive-optimized, ACID support, similar to Parquet, stronger in Hive ecosystem
  • Arrow — In-memory columnar format, zero-copy IPC (inter-process communication), language-agnostic, standard for data interchange between systems (Pandas, Spark, DuckDB)

Schema Management

  • Schema Registries — Confluent Schema Registry (Kafka), store and version schemas, enforce compatibility at produce time
  • Compatibility Modes — Backward (new schema reads old data), forward (old schema reads new data), full (both), none
  • Schema Evolution Strategies — Add optional fields, never remove required fields, use field IDs not names (Protobuf), union types (Avro)

Choosing a Format

  • Human-Readability vs Performance — JSON/YAML for config and debugging, Protobuf/Avro for production data, Parquet for analytics
  • Schema Enforcement — Protobuf/Avro enforce schemas and catch errors early; JSON/YAML are flexible but error-prone
  • Language Support — JSON universal, Protobuf/Avro have code generation for most languages, Parquet/Arrow have strong Python/JVM support
  • Use Case Mapping — API payloads (JSON, Protobuf), config files (YAML, TOML), event streaming (Avro, Protobuf), analytics storage (Parquet), in-memory processing (Arrow)

data databases caching pipelines database-selection sharding partitioning replication search serialization database-operations