pgvector-tutorial

A hands-on learning sandbox for PostgreSQL vector database operations, designed as a bridge between theoretical knowledge and production implementation for the Homer AI agent (cori.agent).

Purpose

This project is a practical tutorial and experimentation environment for understanding PostgreSQL vector database capabilities before integrating vector search into Homer AI's production architecture. It addresses the critical gap between studying Amazon Bedrock + AWS Amplify AI Kit documentation and actually building a working RAG (Retrieval Augmented Generation) system.

The tutorial serves three strategic goals:

Hands-on Learning - Provides executable R code examples that demonstrate vector embeddings, semantic search, and RAG patterns in a safe sandbox environment, separate from production systems.
Cost-Effective Architecture Validation - Tests the PostgreSQL pgvector approach (estimated $15-20/month via AWS Aurora Serverless) as a more economical alternative to AWS OpenSearch ($350/month), validating the technology choice before production deployment.
Homer AI Integration Preparation - Uses Homer's actual domain (rural county data, RIN communities, economic metrics) to develop RAG patterns that will directly transfer to the production Rural Insights Hub (RIH) platform on AWS Bedrock.

This tutorial implements a "learn by doing, then deploy" philosophy: experiment with real data and realistic queries in development, then apply proven patterns to production infrastructure.

Current Scope

What This Tutorial Covers

The tutorial provides comprehensive coverage across three main areas:

1. Vector Database Fundamentals

Understanding embeddings as semantic representations of text
PostgreSQL pgvector extension setup and configuration
Vector similarity search using cosine distance
Database schema design for vector storage with metadata
Vector indexing strategies (IVFFlat vs HNSW) for performance optimization

2. Complete RAG Pipeline Implementation

Document Processing:

Text extraction from multiple formats (Markdown, TXT, CSV, PDF)
Intelligent chunking with configurable overlap for context preservation
Chunk size optimization for different content types (technical docs vs narrative)

Embedding Generation:

Google AI text-embedding-004 integration (768-dimensional vectors)
Batch processing with rate limiting and error handling
Fallback mechanisms for testing without API keys
Alternative approaches: local models and Vertex AI

Semantic Search:

Basic similarity search with configurable thresholds
Advanced hybrid search (vector similarity + metadata filters)
Query pattern matching for structured data translation

RAG with Claude:

Retrieval-augmented generation pipeline
Context assembly from semantic search results
LLM prompt enhancement with retrieved documents
Complete question-answering workflow

3. Homer-Specific Use Cases

Structured Data Integration:

County-level economic data (188 counties in rin_service_areas.parquet)
Four query patterns: county FIPS lookup, RIN community name lookup (fuzzy), state filtering, ranking/sorting
DuckDB for efficient Parquet file querying with case-insensitive partial matching

Knowledge Base Documents:

Geographic descriptions (FIPS codes, RIN communities, boundaries) - Updated with fuzzy matching guidelines
Query pattern training data for natural language → SQL translation - Now includes 4th pattern
Domain knowledge vs procedural knowledge separation

Query Pattern Matching (Example 2): Demonstrates how vector search enables flexible query routing with fuzzy matching:

User asks: "What is the location of The Dalles?"
Preprocessing: "The Dalles" → "Dalles" (article stripping)
Vector search matches to pattern: "RIN community lookup"
System executes: SELECT * FROM rin_service_areas WHERE rin_community LIKE '%Dalles%'
Result: The Dalles (FIPS 41065), Wasco County, Lat 45.5415, Lon -121.2041

This demonstrates the critical fuzzy matching capability: handling natural language variations ("The Dalles", "the dalles", "Dalles") through article stripping and case-insensitive partial matching. This is essential for Homer to translate natural language into precise database queries.

Tutorial Structure

The main tutorial document is vector_database_tutorial.qmd, organized as:

Introduction - Why vector databases, key concepts, use cases
Prerequisites - R packages, data files, API keys
Workflow Overview - Document ingestion pipeline visualization
Schema Design - PostgreSQL table structures for vectors + metadata
Document Processing - Extract, chunk, embed pipeline
Semantic Search - Basic and advanced similarity queries
Complete Examples - Three end-to-end workflows:
- Example 1: Document ingestion with Claude analysis
- Example 2: Query pattern matching (demonstrates the promise made in the intro)
- Example 3: Incremental updates

Data Assets

data/rin_service_areas.parquet (7.6 MB) - County-level data for 188 rural counties
knowledge_base/docs/geographic_descriptions.md - Unstructured documentation for semantic search, includes fuzzy matching guidelines for RIN communities
knowledge_base/patterns/query_patterns.json - Training data for 4 query patterns including RIN community name lookup with preprocessing instructions

Validation Tools

run_tests.R - Comprehensive setup validation script:
- Checks R package dependencies
- Verifies data files exist
- Tests DuckDB queries (all 4 patterns including fuzzy match)
- Validates document processing functions (including strip_leading_articles())
- Checks PostgreSQL connectivity
- Reports API key status

Run before starting: /usr/local/bin/Rscript run_tests.R

Testing Documentation

TEST_RESULTS.md - Complete testing documentation from 2025-11-09:
- All core functionality tested and passing
- Bug fixes applied (chunking function)
- Setup recommendations
- Known limitations
UPDATES_RIN_COMMUNITY_MATCHING.md - Comprehensive change log for fuzzy matching implementation:
- Problem statement
- All changes made to knowledge base and tutorial
- SQL pattern comparisons
- Production implementation notes
- Testing results

What's NOT Included (Yet)

AWS Aurora PostgreSQL Setup Tutorial

Status: 📝 Not Yet Created

The tutorial currently assumes you have access to a PostgreSQL database with pgvector installed, either via:

Local PostgreSQL installation
Existing RDS/Aurora instance
cori.db package connection (if available)

A companion guide for AWS Aurora PostgreSQL with pgvector setup is planned but not yet written. This guide would cover:

Aurora Serverless v2 configuration
pgvector extension installation
Security groups and IAM roles
Connection string configuration
Cost optimization strategies

AWS Bedrock Knowledge Base Integration

This tutorial focuses on self-managed PostgreSQL pgvector. A separate guide would be needed to cover:

AWS Bedrock Knowledge Base setup
Automatic embedding and chunking via Bedrock
S3-based document ingestion
Managed vector stores (Aurora, OpenSearch, Pinecone)

Strategic Alignment with Homer AI

This tutorial directly supports the Homer AI Development Roadmap.

Problems It Solves:

✅ Provides hands-on experience before production integration
✅ Uses county-level data examples (matches Homer's domain)
✅ Explores PostgreSQL pgvector (the cost-effective choice)
✅ Tests RAG patterns needed for Homer's query capabilities
✅ Validates hybrid search architecture (vectors + SQL)

What It's NOT:

❌ Not a production implementation
❌ Not directly integrated with Amplify AI Kit yet
❌ Not connected to Homer's current Gemini backend

Next Steps (After This Tutorial):

Phase 1: Integrate Learnings - Apply patterns from this tutorial to cori.agent codebase
Phase 2: Amplify AI Kit - Implement AWS Amplify AI Kit with knowledge base
Phase 3: Production RAG - Deploy to AWS Bedrock for Rural Insights Hub
Phase 4: Homer Enhancement - Replace Gemini backend with Bedrock + pgvector RAG

Getting Started

Prerequisites

Required:

R (version 4.0+)
R packages: DBI, RPostgreSQL, duckdb, httr, jsonlite, dplyr, readr, stringr
PostgreSQL database with pgvector extension

Optional (for full RAG examples):

Google AI API key (or Vertex AI credentials)
Anthropic Claude API key

Quick Start

Validate your setup:
```
Rscript R/run_tests.R
```
Open the tutorial:
- Open vector_database_tutorial.qmd in RStudio
- Or render with: quarto render vector_database_tutorial.qmd
Start with structured data examples (no API keys needed):
- Run the DuckDB query patterns (4 patterns total)
- See query results for:
  - Hood River County (FIPS lookup)
  - The Dalles (RIN community fuzzy match)
  - Arizona RIN communities (state filter)
  - Largest counties (ranking)
Move to document processing:
- Test extract_text() and chunk_text() functions
- Process the geographic descriptions markdown file
Try embeddings and RAG (requires API keys):
- Set environment variables: GOOGLE_API_KEY, ANTHROPIC_API_KEY
- Run Example 1: Document ingestion with semantic search
- Run Example 2: Query pattern matching workflow

Testing Without API Keys

The tutorial includes fallback mechanisms:

Embedding generation uses random vectors if GOOGLE_API_KEY is not set
This allows testing the pipeline without API costs
⚠️ Random embeddings won't produce meaningful search results

Project Files

pgvector-tutorial/
├── README.md                              # This file
├── CLAUDE.md                              # Project-specific Claude instructions
├── vector_database_tutorial.qmd           # Main tutorial (Quarto markdown)
├── run_tests.R                            # Setup validation script
├── TEST_RESULTS.md                        # Comprehensive testing documentation
├── UPDATES_RIN_COMMUNITY_MATCHING.md      # Fuzzy matching implementation changelog
├── PG_VECTOR_AURORA_DB.md                 # Aurora PostgreSQL setup (placeholder)
├── VECTOR_DATABASE_TUTORIAL.md            # Tutorial documentation notes
├── data/
│   └── rin_service_areas.parquet          # County-level structured data (188 counties)
└── knowledge_base/
    ├── docs/
    │   └── geographic_descriptions.md     # Unstructured docs + fuzzy matching guidelines
    └── patterns/
        └── query_patterns.json            # 4 query patterns with preprocessing notes

Key Learning Outcomes

By completing this tutorial, you will understand:

How embeddings capture semantic meaning in high-dimensional vector space
Why chunking matters and how to optimize chunk size/overlap for your content
How vector similarity search works using cosine distance in PostgreSQL
The RAG pattern - retrieval → context assembly → LLM generation
Hybrid search architecture - combining semantic search with structured filters
Query pattern matching - translating natural language to SQL via vector similarity
Production considerations - batching, indexing, error handling, monitoring

Contributing and Evolution

This is a living tutorial that will evolve as Homer AI development progresses. Planned enhancements:

AWS Aurora PostgreSQL setup guide
AWS Bedrock Knowledge Base parallel implementation
Performance benchmarking with larger datasets
Multi-language embedding models for non-English content
Advanced RAG patterns (re-ranking, query decomposition, iterative retrieval)

Recent Updates (2025-11-09)

✅ Fuzzy Matching Implementation

Added critical RIN community name fuzzy matching capability:

Article stripping function: strip_leading_articles() in R
4th query pattern: RIN community lookup with case-insensitive partial matching
Updated knowledge base: geographic_descriptions.md with fuzzy matching guidelines
Updated query patterns: query_patterns.json now includes preprocessing instructions
Fixed chunking bug: Character-based overlap now works correctly
Complete testing: All query patterns validated with DuckDB

See UPDATES_RIN_COMMUNITY_MATCHING.md for detailed changelog.

🔗 Related Projects

Amplify AI Kit Tutorial: See /Users/johnhall/Documents/CORI/amplify-2/ai-kit/docs/AMPLIFY_AI_KIT_TUTORIAL_complete_implementation.md for production implementation roadmap incorporating these learnings

Support and Resources

Tutorial Document: vector_database_tutorial.qmd
Validation Script: run_tests.R
Testing Results: TEST_RESULTS.md
Fuzzy Matching Updates: UPDATES_RIN_COMMUNITY_MATCHING.md

Status: ✅ Ready for use - All tests passing Last Updated: 2025-11-09 Tested On: macOS with R 4.x, PostgreSQL 15+, pgvector 0.5+ Latest Enhancement: RIN community fuzzy matching with article stripping Next Milestone: AWS Aurora setup guide + Amplify AI Kit integration

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
R		R
data		data
docs		docs
knowledge_base		knowledge_base
.gitignore		.gitignore
01_vector_database_tutorial.qmd		01_vector_database_tutorial.qmd
02_gemini_embeddings_tutorial.qmd		02_gemini_embeddings_tutorial.qmd
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

pgvector-tutorial

Purpose

Current Scope

What This Tutorial Covers

1. Vector Database Fundamentals

2. Complete RAG Pipeline Implementation

3. Homer-Specific Use Cases

Tutorial Structure

Data Assets

Validation Tools

Testing Documentation

What's NOT Included (Yet)

AWS Aurora PostgreSQL Setup Tutorial

AWS Bedrock Knowledge Base Integration

Strategic Alignment with Homer AI

Problems It Solves:

What It's NOT:

Next Steps (After This Tutorial):

Getting Started

Prerequisites

Quick Start

Testing Without API Keys

Project Files

Key Learning Outcomes

Contributing and Evolution

Recent Updates (2025-11-09)

✅ Fuzzy Matching Implementation

🔗 Related Projects

Support and Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages