A hands-on learning sandbox for PostgreSQL vector database operations, designed as a bridge between theoretical knowledge and production implementation for the Homer AI agent (cori.agent).
This project is a practical tutorial and experimentation environment for understanding PostgreSQL vector database capabilities before integrating vector search into Homer AI's production architecture. It addresses the critical gap between studying Amazon Bedrock + AWS Amplify AI Kit documentation and actually building a working RAG (Retrieval Augmented Generation) system.
The tutorial serves three strategic goals:
-
Hands-on Learning - Provides executable R code examples that demonstrate vector embeddings, semantic search, and RAG patterns in a safe sandbox environment, separate from production systems.
-
Cost-Effective Architecture Validation - Tests the PostgreSQL pgvector approach (estimated $15-20/month via AWS Aurora Serverless) as a more economical alternative to AWS OpenSearch ($350/month), validating the technology choice before production deployment.
-
Homer AI Integration Preparation - Uses Homer's actual domain (rural county data, RIN communities, economic metrics) to develop RAG patterns that will directly transfer to the production Rural Insights Hub (RIH) platform on AWS Bedrock.
This tutorial implements a "learn by doing, then deploy" philosophy: experiment with real data and realistic queries in development, then apply proven patterns to production infrastructure.
The tutorial provides comprehensive coverage across three main areas:
- Understanding embeddings as semantic representations of text
- PostgreSQL pgvector extension setup and configuration
- Vector similarity search using cosine distance
- Database schema design for vector storage with metadata
- Vector indexing strategies (IVFFlat vs HNSW) for performance optimization
Document Processing:
- Text extraction from multiple formats (Markdown, TXT, CSV, PDF)
- Intelligent chunking with configurable overlap for context preservation
- Chunk size optimization for different content types (technical docs vs narrative)
Embedding Generation:
- Google AI text-embedding-004 integration (768-dimensional vectors)
- Batch processing with rate limiting and error handling
- Fallback mechanisms for testing without API keys
- Alternative approaches: local models and Vertex AI
Semantic Search:
- Basic similarity search with configurable thresholds
- Advanced hybrid search (vector similarity + metadata filters)
- Query pattern matching for structured data translation
RAG with Claude:
- Retrieval-augmented generation pipeline
- Context assembly from semantic search results
- LLM prompt enhancement with retrieved documents
- Complete question-answering workflow
Structured Data Integration:
- County-level economic data (188 counties in
rin_service_areas.parquet) - Four query patterns: county FIPS lookup, RIN community name lookup (fuzzy), state filtering, ranking/sorting
- DuckDB for efficient Parquet file querying with case-insensitive partial matching
Knowledge Base Documents:
- Geographic descriptions (FIPS codes, RIN communities, boundaries) - Updated with fuzzy matching guidelines
- Query pattern training data for natural language β SQL translation - Now includes 4th pattern
- Domain knowledge vs procedural knowledge separation
Query Pattern Matching (Example 2): Demonstrates how vector search enables flexible query routing with fuzzy matching:
User asks: "What is the location of The Dalles?"
Preprocessing: "The Dalles" β "Dalles" (article stripping)
Vector search matches to pattern: "RIN community lookup"
System executes: SELECT * FROM rin_service_areas WHERE rin_community LIKE '%Dalles%'
Result: The Dalles (FIPS 41065), Wasco County, Lat 45.5415, Lon -121.2041
This demonstrates the critical fuzzy matching capability: handling natural language variations ("The Dalles", "the dalles", "Dalles") through article stripping and case-insensitive partial matching. This is essential for Homer to translate natural language into precise database queries.
The main tutorial document is vector_database_tutorial.qmd, organized as:
- Introduction - Why vector databases, key concepts, use cases
- Prerequisites - R packages, data files, API keys
- Workflow Overview - Document ingestion pipeline visualization
- Schema Design - PostgreSQL table structures for vectors + metadata
- Document Processing - Extract, chunk, embed pipeline
- Semantic Search - Basic and advanced similarity queries
- Complete Examples - Three end-to-end workflows:
- Example 1: Document ingestion with Claude analysis
- Example 2: Query pattern matching (demonstrates the promise made in the intro)
- Example 3: Incremental updates
data/rin_service_areas.parquet(7.6 MB) - County-level data for 188 rural countiesknowledge_base/docs/geographic_descriptions.md- Unstructured documentation for semantic search, includes fuzzy matching guidelines for RIN communitiesknowledge_base/patterns/query_patterns.json- Training data for 4 query patterns including RIN community name lookup with preprocessing instructions
run_tests.R- Comprehensive setup validation script:- Checks R package dependencies
- Verifies data files exist
- Tests DuckDB queries (all 4 patterns including fuzzy match)
- Validates document processing functions (including
strip_leading_articles()) - Checks PostgreSQL connectivity
- Reports API key status
Run before starting: /usr/local/bin/Rscript run_tests.R
-
TEST_RESULTS.md- Complete testing documentation from 2025-11-09:- All core functionality tested and passing
- Bug fixes applied (chunking function)
- Setup recommendations
- Known limitations
-
UPDATES_RIN_COMMUNITY_MATCHING.md- Comprehensive change log for fuzzy matching implementation:- Problem statement
- All changes made to knowledge base and tutorial
- SQL pattern comparisons
- Production implementation notes
- Testing results
Status: π Not Yet Created
The tutorial currently assumes you have access to a PostgreSQL database with pgvector installed, either via:
- Local PostgreSQL installation
- Existing RDS/Aurora instance
cori.dbpackage connection (if available)
A companion guide for AWS Aurora PostgreSQL with pgvector setup is planned but not yet written. This guide would cover:
- Aurora Serverless v2 configuration
- pgvector extension installation
- Security groups and IAM roles
- Connection string configuration
- Cost optimization strategies
This tutorial focuses on self-managed PostgreSQL pgvector. A separate guide would be needed to cover:
- AWS Bedrock Knowledge Base setup
- Automatic embedding and chunking via Bedrock
- S3-based document ingestion
- Managed vector stores (Aurora, OpenSearch, Pinecone)
This tutorial directly supports the Homer AI Development Roadmap.
- β Provides hands-on experience before production integration
- β Uses county-level data examples (matches Homer's domain)
- β Explores PostgreSQL pgvector (the cost-effective choice)
- β Tests RAG patterns needed for Homer's query capabilities
- β Validates hybrid search architecture (vectors + SQL)
- β Not a production implementation
- β Not directly integrated with Amplify AI Kit yet
- β Not connected to Homer's current Gemini backend
- Phase 1: Integrate Learnings - Apply patterns from this tutorial to
cori.agentcodebase - Phase 2: Amplify AI Kit - Implement AWS Amplify AI Kit with knowledge base
- Phase 3: Production RAG - Deploy to AWS Bedrock for Rural Insights Hub
- Phase 4: Homer Enhancement - Replace Gemini backend with Bedrock + pgvector RAG
Required:
- R (version 4.0+)
- R packages:
DBI,RPostgreSQL,duckdb,httr,jsonlite,dplyr,readr,stringr - PostgreSQL database with pgvector extension
Optional (for full RAG examples):
- Google AI API key (or Vertex AI credentials)
- Anthropic Claude API key
-
Validate your setup:
Rscript R/run_tests.R
-
Open the tutorial:
- Open
vector_database_tutorial.qmdin RStudio - Or render with:
quarto render vector_database_tutorial.qmd
- Open
-
Start with structured data examples (no API keys needed):
- Run the DuckDB query patterns (4 patterns total)
- See query results for:
- Hood River County (FIPS lookup)
- The Dalles (RIN community fuzzy match)
- Arizona RIN communities (state filter)
- Largest counties (ranking)
-
Move to document processing:
- Test
extract_text()andchunk_text()functions - Process the geographic descriptions markdown file
- Test
-
Try embeddings and RAG (requires API keys):
- Set environment variables:
GOOGLE_API_KEY,ANTHROPIC_API_KEY - Run Example 1: Document ingestion with semantic search
- Run Example 2: Query pattern matching workflow
- Set environment variables:
The tutorial includes fallback mechanisms:
- Embedding generation uses random vectors if
GOOGLE_API_KEYis not set - This allows testing the pipeline without API costs
β οΈ Random embeddings won't produce meaningful search results
pgvector-tutorial/
βββ README.md # This file
βββ CLAUDE.md # Project-specific Claude instructions
βββ vector_database_tutorial.qmd # Main tutorial (Quarto markdown)
βββ run_tests.R # Setup validation script
βββ TEST_RESULTS.md # Comprehensive testing documentation
βββ UPDATES_RIN_COMMUNITY_MATCHING.md # Fuzzy matching implementation changelog
βββ PG_VECTOR_AURORA_DB.md # Aurora PostgreSQL setup (placeholder)
βββ VECTOR_DATABASE_TUTORIAL.md # Tutorial documentation notes
βββ data/
β βββ rin_service_areas.parquet # County-level structured data (188 counties)
βββ knowledge_base/
βββ docs/
β βββ geographic_descriptions.md # Unstructured docs + fuzzy matching guidelines
βββ patterns/
βββ query_patterns.json # 4 query patterns with preprocessing notes
By completing this tutorial, you will understand:
- How embeddings capture semantic meaning in high-dimensional vector space
- Why chunking matters and how to optimize chunk size/overlap for your content
- How vector similarity search works using cosine distance in PostgreSQL
- The RAG pattern - retrieval β context assembly β LLM generation
- Hybrid search architecture - combining semantic search with structured filters
- Query pattern matching - translating natural language to SQL via vector similarity
- Production considerations - batching, indexing, error handling, monitoring
This is a living tutorial that will evolve as Homer AI development progresses. Planned enhancements:
- AWS Aurora PostgreSQL setup guide
- AWS Bedrock Knowledge Base parallel implementation
- Performance benchmarking with larger datasets
- Multi-language embedding models for non-English content
- Advanced RAG patterns (re-ranking, query decomposition, iterative retrieval)
Added critical RIN community name fuzzy matching capability:
- Article stripping function:
strip_leading_articles()in R - 4th query pattern: RIN community lookup with case-insensitive partial matching
- Updated knowledge base:
geographic_descriptions.mdwith fuzzy matching guidelines - Updated query patterns:
query_patterns.jsonnow includes preprocessing instructions - Fixed chunking bug: Character-based overlap now works correctly
- Complete testing: All query patterns validated with DuckDB
See UPDATES_RIN_COMMUNITY_MATCHING.md for detailed changelog.
- Amplify AI Kit Tutorial: See
/Users/johnhall/Documents/CORI/amplify-2/ai-kit/docs/AMPLIFY_AI_KIT_TUTORIAL_complete_implementation.mdfor production implementation roadmap incorporating these learnings
- Tutorial Document: vector_database_tutorial.qmd
- Validation Script: run_tests.R
- Testing Results: TEST_RESULTS.md
- Fuzzy Matching Updates: UPDATES_RIN_COMMUNITY_MATCHING.md
Status: β Ready for use - All tests passing Last Updated: 2025-11-09 Tested On: macOS with R 4.x, PostgreSQL 15+, pgvector 0.5+ Latest Enhancement: RIN community fuzzy matching with article stripping Next Milestone: AWS Aurora setup guide + Amplify AI Kit integration