Skip to content

ruralinnovation/pgvector-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

pgvector-tutorial

A hands-on learning sandbox for PostgreSQL vector database operations, designed as a bridge between theoretical knowledge and production implementation for the Homer AI agent (cori.agent).

Purpose

This project is a practical tutorial and experimentation environment for understanding PostgreSQL vector database capabilities before integrating vector search into Homer AI's production architecture. It addresses the critical gap between studying Amazon Bedrock + AWS Amplify AI Kit documentation and actually building a working RAG (Retrieval Augmented Generation) system.

The tutorial serves three strategic goals:

  1. Hands-on Learning - Provides executable R code examples that demonstrate vector embeddings, semantic search, and RAG patterns in a safe sandbox environment, separate from production systems.

  2. Cost-Effective Architecture Validation - Tests the PostgreSQL pgvector approach (estimated $15-20/month via AWS Aurora Serverless) as a more economical alternative to AWS OpenSearch ($350/month), validating the technology choice before production deployment.

  3. Homer AI Integration Preparation - Uses Homer's actual domain (rural county data, RIN communities, economic metrics) to develop RAG patterns that will directly transfer to the production Rural Insights Hub (RIH) platform on AWS Bedrock.

This tutorial implements a "learn by doing, then deploy" philosophy: experiment with real data and realistic queries in development, then apply proven patterns to production infrastructure.

Current Scope

What This Tutorial Covers

The tutorial provides comprehensive coverage across three main areas:

1. Vector Database Fundamentals

  • Understanding embeddings as semantic representations of text
  • PostgreSQL pgvector extension setup and configuration
  • Vector similarity search using cosine distance
  • Database schema design for vector storage with metadata
  • Vector indexing strategies (IVFFlat vs HNSW) for performance optimization

2. Complete RAG Pipeline Implementation

Document Processing:

  • Text extraction from multiple formats (Markdown, TXT, CSV, PDF)
  • Intelligent chunking with configurable overlap for context preservation
  • Chunk size optimization for different content types (technical docs vs narrative)

Embedding Generation:

  • Google AI text-embedding-004 integration (768-dimensional vectors)
  • Batch processing with rate limiting and error handling
  • Fallback mechanisms for testing without API keys
  • Alternative approaches: local models and Vertex AI

Semantic Search:

  • Basic similarity search with configurable thresholds
  • Advanced hybrid search (vector similarity + metadata filters)
  • Query pattern matching for structured data translation

RAG with Claude:

  • Retrieval-augmented generation pipeline
  • Context assembly from semantic search results
  • LLM prompt enhancement with retrieved documents
  • Complete question-answering workflow

3. Homer-Specific Use Cases

Structured Data Integration:

  • County-level economic data (188 counties in rin_service_areas.parquet)
  • Four query patterns: county FIPS lookup, RIN community name lookup (fuzzy), state filtering, ranking/sorting
  • DuckDB for efficient Parquet file querying with case-insensitive partial matching

Knowledge Base Documents:

  • Geographic descriptions (FIPS codes, RIN communities, boundaries) - Updated with fuzzy matching guidelines
  • Query pattern training data for natural language β†’ SQL translation - Now includes 4th pattern
  • Domain knowledge vs procedural knowledge separation

Query Pattern Matching (Example 2): Demonstrates how vector search enables flexible query routing with fuzzy matching:

User asks: "What is the location of The Dalles?"
Preprocessing: "The Dalles" β†’ "Dalles" (article stripping)
Vector search matches to pattern: "RIN community lookup"
System executes: SELECT * FROM rin_service_areas WHERE rin_community LIKE '%Dalles%'
Result: The Dalles (FIPS 41065), Wasco County, Lat 45.5415, Lon -121.2041

This demonstrates the critical fuzzy matching capability: handling natural language variations ("The Dalles", "the dalles", "Dalles") through article stripping and case-insensitive partial matching. This is essential for Homer to translate natural language into precise database queries.

Tutorial Structure

The main tutorial document is vector_database_tutorial.qmd, organized as:

  1. Introduction - Why vector databases, key concepts, use cases
  2. Prerequisites - R packages, data files, API keys
  3. Workflow Overview - Document ingestion pipeline visualization
  4. Schema Design - PostgreSQL table structures for vectors + metadata
  5. Document Processing - Extract, chunk, embed pipeline
  6. Semantic Search - Basic and advanced similarity queries
  7. Complete Examples - Three end-to-end workflows:
    • Example 1: Document ingestion with Claude analysis
    • Example 2: Query pattern matching (demonstrates the promise made in the intro)
    • Example 3: Incremental updates

Data Assets

  • data/rin_service_areas.parquet (7.6 MB) - County-level data for 188 rural counties
  • knowledge_base/docs/geographic_descriptions.md - Unstructured documentation for semantic search, includes fuzzy matching guidelines for RIN communities
  • knowledge_base/patterns/query_patterns.json - Training data for 4 query patterns including RIN community name lookup with preprocessing instructions

Validation Tools

  • run_tests.R - Comprehensive setup validation script:
    • Checks R package dependencies
    • Verifies data files exist
    • Tests DuckDB queries (all 4 patterns including fuzzy match)
    • Validates document processing functions (including strip_leading_articles())
    • Checks PostgreSQL connectivity
    • Reports API key status

Run before starting: /usr/local/bin/Rscript run_tests.R

Testing Documentation

  • TEST_RESULTS.md - Complete testing documentation from 2025-11-09:

    • All core functionality tested and passing
    • Bug fixes applied (chunking function)
    • Setup recommendations
    • Known limitations
  • UPDATES_RIN_COMMUNITY_MATCHING.md - Comprehensive change log for fuzzy matching implementation:

    • Problem statement
    • All changes made to knowledge base and tutorial
    • SQL pattern comparisons
    • Production implementation notes
    • Testing results

What's NOT Included (Yet)

AWS Aurora PostgreSQL Setup Tutorial

Status: πŸ“ Not Yet Created

The tutorial currently assumes you have access to a PostgreSQL database with pgvector installed, either via:

  • Local PostgreSQL installation
  • Existing RDS/Aurora instance
  • cori.db package connection (if available)

A companion guide for AWS Aurora PostgreSQL with pgvector setup is planned but not yet written. This guide would cover:

  • Aurora Serverless v2 configuration
  • pgvector extension installation
  • Security groups and IAM roles
  • Connection string configuration
  • Cost optimization strategies

AWS Bedrock Knowledge Base Integration

This tutorial focuses on self-managed PostgreSQL pgvector. A separate guide would be needed to cover:

  • AWS Bedrock Knowledge Base setup
  • Automatic embedding and chunking via Bedrock
  • S3-based document ingestion
  • Managed vector stores (Aurora, OpenSearch, Pinecone)

Strategic Alignment with Homer AI

This tutorial directly supports the Homer AI Development Roadmap.

Problems It Solves:

  • βœ… Provides hands-on experience before production integration
  • βœ… Uses county-level data examples (matches Homer's domain)
  • βœ… Explores PostgreSQL pgvector (the cost-effective choice)
  • βœ… Tests RAG patterns needed for Homer's query capabilities
  • βœ… Validates hybrid search architecture (vectors + SQL)

What It's NOT:

  • ❌ Not a production implementation
  • ❌ Not directly integrated with Amplify AI Kit yet
  • ❌ Not connected to Homer's current Gemini backend

Next Steps (After This Tutorial):

  1. Phase 1: Integrate Learnings - Apply patterns from this tutorial to cori.agent codebase
  2. Phase 2: Amplify AI Kit - Implement AWS Amplify AI Kit with knowledge base
  3. Phase 3: Production RAG - Deploy to AWS Bedrock for Rural Insights Hub
  4. Phase 4: Homer Enhancement - Replace Gemini backend with Bedrock + pgvector RAG

Getting Started

Prerequisites

Required:

  • R (version 4.0+)
  • R packages: DBI, RPostgreSQL, duckdb, httr, jsonlite, dplyr, readr, stringr
  • PostgreSQL database with pgvector extension

Optional (for full RAG examples):

  • Google AI API key (or Vertex AI credentials)
  • Anthropic Claude API key

Quick Start

  1. Validate your setup:

    Rscript R/run_tests.R
  2. Open the tutorial:

    • Open vector_database_tutorial.qmd in RStudio
    • Or render with: quarto render vector_database_tutorial.qmd
  3. Start with structured data examples (no API keys needed):

    • Run the DuckDB query patterns (4 patterns total)
    • See query results for:
      • Hood River County (FIPS lookup)
      • The Dalles (RIN community fuzzy match)
      • Arizona RIN communities (state filter)
      • Largest counties (ranking)
  4. Move to document processing:

    • Test extract_text() and chunk_text() functions
    • Process the geographic descriptions markdown file
  5. Try embeddings and RAG (requires API keys):

    • Set environment variables: GOOGLE_API_KEY, ANTHROPIC_API_KEY
    • Run Example 1: Document ingestion with semantic search
    • Run Example 2: Query pattern matching workflow

Testing Without API Keys

The tutorial includes fallback mechanisms:

  • Embedding generation uses random vectors if GOOGLE_API_KEY is not set
  • This allows testing the pipeline without API costs
  • ⚠️ Random embeddings won't produce meaningful search results

Project Files

pgvector-tutorial/
β”œβ”€β”€ README.md                              # This file
β”œβ”€β”€ CLAUDE.md                              # Project-specific Claude instructions
β”œβ”€β”€ vector_database_tutorial.qmd           # Main tutorial (Quarto markdown)
β”œβ”€β”€ run_tests.R                            # Setup validation script
β”œβ”€β”€ TEST_RESULTS.md                        # Comprehensive testing documentation
β”œβ”€β”€ UPDATES_RIN_COMMUNITY_MATCHING.md      # Fuzzy matching implementation changelog
β”œβ”€β”€ PG_VECTOR_AURORA_DB.md                 # Aurora PostgreSQL setup (placeholder)
β”œβ”€β”€ VECTOR_DATABASE_TUTORIAL.md            # Tutorial documentation notes
β”œβ”€β”€ data/
β”‚   └── rin_service_areas.parquet          # County-level structured data (188 counties)
└── knowledge_base/
    β”œβ”€β”€ docs/
    β”‚   └── geographic_descriptions.md     # Unstructured docs + fuzzy matching guidelines
    └── patterns/
        └── query_patterns.json            # 4 query patterns with preprocessing notes

Key Learning Outcomes

By completing this tutorial, you will understand:

  1. How embeddings capture semantic meaning in high-dimensional vector space
  2. Why chunking matters and how to optimize chunk size/overlap for your content
  3. How vector similarity search works using cosine distance in PostgreSQL
  4. The RAG pattern - retrieval β†’ context assembly β†’ LLM generation
  5. Hybrid search architecture - combining semantic search with structured filters
  6. Query pattern matching - translating natural language to SQL via vector similarity
  7. Production considerations - batching, indexing, error handling, monitoring

Contributing and Evolution

This is a living tutorial that will evolve as Homer AI development progresses. Planned enhancements:

  • AWS Aurora PostgreSQL setup guide
  • AWS Bedrock Knowledge Base parallel implementation
  • Performance benchmarking with larger datasets
  • Multi-language embedding models for non-English content
  • Advanced RAG patterns (re-ranking, query decomposition, iterative retrieval)

Recent Updates (2025-11-09)

βœ… Fuzzy Matching Implementation

Added critical RIN community name fuzzy matching capability:

  • Article stripping function: strip_leading_articles() in R
  • 4th query pattern: RIN community lookup with case-insensitive partial matching
  • Updated knowledge base: geographic_descriptions.md with fuzzy matching guidelines
  • Updated query patterns: query_patterns.json now includes preprocessing instructions
  • Fixed chunking bug: Character-based overlap now works correctly
  • Complete testing: All query patterns validated with DuckDB

See UPDATES_RIN_COMMUNITY_MATCHING.md for detailed changelog.

πŸ”— Related Projects

  • Amplify AI Kit Tutorial: See /Users/johnhall/Documents/CORI/amplify-2/ai-kit/docs/AMPLIFY_AI_KIT_TUTORIAL_complete_implementation.md for production implementation roadmap incorporating these learnings

Support and Resources


Status: βœ… Ready for use - All tests passing Last Updated: 2025-11-09 Tested On: macOS with R 4.x, PostgreSQL 15+, pgvector 0.5+ Latest Enhancement: RIN community fuzzy matching with article stripping Next Milestone: AWS Aurora setup guide + Amplify AI Kit integration

About

An introductory example of converting knowledge base documents into vector embeddings for use in Retrieval Augmented Generation (RAG)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages