← Back to Projects

EchoRidge Search - AI-Powered Business Intelligence Platform

Complete business intelligence platform: AI pipeline + cloud storage + real-time visualization

Quick Start • Architecture • Pipeline Backend • Frontend v2

Project Echo Ridge - AI Business Intelligence Pipeline

🏗️ Architecture Overview

EchoRidge Search consists of three integrated systems:

Pipeline Backend (Batch Processing)

CLI Input → Places Discovery → Web Scraping → Hybrid AI Scoring → GCS Storage

Pipeline API (Orchestration)

Frontend → Pipeline API → PMF Backend → GCS Upload → Status Polling

Frontend v2 (Real-time Visualization)

User Interface → Pipeline API → Run Status → GCS Results → Map Visualization

Complete Data Flow

User Query → Pipeline API (8082) → PMF Backend → JSONL Files → GCS Bucket
                                                                    ↓
Frontend (3000) ← Status Updates ← Pipeline API ← GCS Run Data

🎯 Key Features

Cloud-Native Storage: Google Cloud Storage for pipeline results
Hybrid AI Scoring: Combined LLM + deterministic echo-ridge scoring
Real-time Pipeline: Live status updates and progress tracking
Geographic Visualization: Map-based results display
Production Ready: GCS integration, status polling, error handling
Scalable: Cloud storage handles unlimited pipeline runs

🚀 Quick Start

Prerequisites

Required API Keys (add to .env file):
- GOOGLE_PLACES_API_KEY - For place discovery
- FIRECRAWL_API_KEY - For web scraping
- OPENAI_API_KEY - For AI scoring
Google Cloud Setup (for GCS storage):
- Create a GCS bucket
- Set up service account credentials
- Download credentials JSON file

Option 1: Full Stack (Pipeline + Frontend)

# 1. Setup environment
cp .env.example .env
# Edit .env with your API keys and GCS configuration

# 2. Install backend dependencies
cd pmf_finder_backend
pip install -r requirements.txt

# 3. Start Pipeline API
cd ../services/pipeline
python main.py &  # Runs on port 8082

# 4. Start Frontend v2
cd ../../echoridge_search_frontend_v2
npm install
npm run dev  # Visit http://localhost:3000

Option 2: CLI Pipeline Only

# Run pipeline directly with CLI
cd pmf_finder_backend
python run_pipeline.py --query "Private schools in Tampa" --max 25

# With GCS upload enabled
export GCS_ENABLED=true
export GCS_BUCKET_NAME=your-bucket-name
export GCS_CREDENTIALS_PATH=/path/to/credentials.json
python run_pipeline.py --query "Restaurants in Austin" --max 50

Option 3: API-Triggered Pipeline

# Start Pipeline API
cd services/pipeline
python main.py

# Trigger from another terminal
curl -X POST http://localhost:8082/v1/pipeline/execute \
  -H "Content-Type: application/json" \
  -d '{"query": "coffee shops in Seattle", "max_results": 20}'

# Check status
curl http://localhost:8082/v1/pipeline/status/{job_id}

📦 Services Architecture

Pipeline Backend (`pmf_finder_backend/`)

Language: Python 3.11+ with asyncio
Purpose: Batch business discovery and hybrid AI scoring
APIs: Google Places, OpenAI GPT-4, Firecrawl, echo-ridge
Output: Structured JSONL files (places, scrapes, scores, hybrid results)
Storage: Local filesystem + Google Cloud Storage

Pipeline API (`services/pipeline/`)

Language: Python FastAPI
Port: 8082
Purpose: Orchestrate pipeline execution and serve results from GCS
Key Endpoints:
- /v1/pipeline/execute - Trigger new pipeline run
- /v1/pipeline/status/{job_id} - Get execution status
- /v1/runs/list - List all runs from GCS
- /v1/runs/{run_id} - Get complete run data
- /v1/runs/{run_id}/map-data - Get visualization data

GCS Storage Client (`pmf_finder_backend/storage/`)

Purpose: Upload and retrieve pipeline results from cloud
Features:
- Automatic upload after pipeline completion
- Organized storage: results/YYYY/MM/DD/run_id/
- Manifest files with run metadata
- Signed URL generation for file access

Frontend v2 (`echoridge_search_frontend_v2/`)

Framework: Next.js 15 + React 19 + TypeScript
Port: 3000
Purpose: Real-time pipeline execution and results visualization
Features:
- Pipeline trigger interface
- Live status polling
- Run history from GCS
- Map-based results display (coming soon)
- Responsive dark UI with Tailwind CSS

Storage Layer

Google Cloud Storage: Primary storage for pipeline results
Local JSONL Files: Temporary storage during pipeline execution
Structure: Date-organized with run manifests

🛠️ Development Commands

# Backend Pipeline
cd pmf_finder_backend
python run_pipeline.py --query "Schools in NYC" --max 50

# ETL Operations
cd services/etl
python main.py ingest --latest
python main.py status
python main.py list-runs

# API Development
cd services/api
python main.py  # Start on localhost:8081

# Frontend Development
cd echoridge_search_frontend
npm run dev  # Start on localhost:3000

🌐 User Interfaces

Modern Search Frontend (Port 3000)

🔍 Universal Search: Query 50k+ companies with instant results
📊 Faceted Filtering: Filter by region, industry, score ranges
⚡ SQLite WASM: Client-side caching for <350ms performance
🔗 Evidence Links: Direct access to scraped web content
👤 User Overlays: Bookmarks, notes, comparisons (local storage)

Legacy Flask Dashboard (Port 5000)

📈 Run Analytics: Cross-run business intelligence
🗺️ Heat Maps: Geographic density visualization
📋 Run Management: Pipeline execution history
🔧 Development: Backend data exploration and debugging

🔌 API Endpoints

Catalog API (`localhost:8081`)

# Company search with pagination
GET /v1/catalog/companies?q=schools®ion=florida&limit=20

# Single company details
GET /v1/catalog/companies/{id}

# Evidence for a company
GET /v1/catalog/companies/{id}/evidence

# Statistics and health
GET /v1/catalog/stats
GET /health

Legacy Flask API (`localhost:5000`)

GET /api/runs           # Available pipeline runs
GET /api/analytics      # Aggregated statistics
GET /api/heatmap-data   # Geographic distribution

⚙️ Configuration

Pipeline Environment (`.env`)

# Required API Keys
GOOGLE_PLACES_API_KEY="your_google_places_api_key"
OPENAI_API_KEY="your_openai_api_key"
FIRECRAWL_API_KEY="your_firecrawl_api_key"

# Google Cloud Storage Configuration
GCS_ENABLED=true
GCS_BUCKET_NAME="your-bucket-name"
GCS_CREDENTIALS_PATH="/path/to/service-account-key.json"
GCP_PROJECT_ID="your-project-id"

# Hybrid Scoring Configuration
HYBRID_SCORING_ENABLED=true
EMBEDDED_ECHO_RIDGE_ENABLED=true
ECHO_RIDGE_HOST=127.0.0.1
ECHO_RIDGE_PORT=8070

# AI Model Configuration
GPT_MODEL="gpt-4o-mini"
GOOGLE_PLACES_REQUESTS_PER_MINUTE=600
OPENAI_REQUESTS_PER_MINUTE=500

Frontend v2 Environment (`.env.local`)

# Pipeline API endpoint
NEXT_PUBLIC_API_URL=http://localhost:8082

# Optional: Disable telemetry
NEXT_TELEMETRY_DISABLED=1

Geographic Coverage

Predefined Metro Areas (instant resolution):

Chicago, NYC, Los Angeles, Boston, Seattle, Denver, Phoenix, Miami,
Atlanta, Dallas, Houston, San Francisco, Portland, Nashville, Detroit,
Minneapolis, Tampa, Orlando, Charlotte, Indianapolis, Columbus, Austin,
Jacksonville, San Antonio, San Diego, Fort Worth, Philadelphia,
Washington DC, Las Vegas, Rapid City, Syracuse, Madison + more...

Dynamic Geocoding: Automatic fallback to Google Geocoding API for unlisted regions

Project Structure

pmf_finder_backend/
├── run_pipeline.py              # Main CLI entry point
├── input/
│   ├── cli.py                   # Query processing & orchestration
│   └── geofence.py              # Geographic boundary resolution
├── places_search/
│   └── places_google.py         # Google Places API integration
├── dedupe.py                    # Multi-strategy deduplication
├── scraping_module/
│   └── scrape_firecrawl.py      # Web content extraction
├── llm_scoring/
│   └── llm_score.py             # OpenAI GPT-4 DIMB analysis
├── flask_interface/
│   ├── app.py                   # Flask web application
│   ├── templates/               # Web dashboard UI
│   └── static/                  # Assets & styling
├── data/                        # Organized output storage
│   └── YYYY/MM/DD/run_id/       # Timestamped hierarchies
└── common_models.py             # Pydantic data validation

Advanced Usage

Custom Queries

# Geographic specificity
python run_pipeline.py --query "Dentists in Rapid City South Dakota" --max 30

# Category + location patterns
python run_pipeline.py --query "Coffee shops in Syracuse NY" --max 40
python run_pipeline.py --query "Tech companies in Madison Wisconsin" --max 50

Output Structure

Each run creates organized data locally and in GCS at results/YYYY/MM/DD/run_id/:

├── query.json              # Original query & geofence
├── run.json                # Execution metadata & statistics
├── manifest.json           # GCS upload metadata
├── places/
│   ├── places_raw.jsonl    # Raw API responses
│   └── places_norm.jsonl   # Discovered & normalized businesses
├── scrapes/
│   └── scrapes.jsonl       # Web content snapshots
└── scores/
    ├── scores.jsonl        # Full AI scorecards with reasoning
    ├── scores_condensed.jsonl  # Summary data
    └── hybrid_results.jsonl    # Combined AI + deterministic scores

Run ID Format: YYYYMMDD_HHMMSS_category_region

Example: 20251012_143025_private_schools_tampa
Enables date-based organization and easy identification

🔧 Troubleshooting

Common Issues

Pipeline API not accessible

# Check if Pipeline API is running
curl http://localhost:8082/health

# Start Pipeline API if needed
cd services/pipeline
python main.py

GCS upload failing

# Verify GCS configuration
echo $GCS_ENABLED
echo $GCS_BUCKET_NAME

# Test credentials
gcloud auth application-default login
# OR set credentials path
export GCS_CREDENTIALS_PATH=/path/to/credentials.json

# Verify bucket exists and is accessible
gsutil ls gs://$GCS_BUCKET_NAME

Frontend not showing runs

# Check Pipeline API GCS endpoints
curl http://localhost:8082/v1/runs/list

# Verify GCS_ENABLED in pipeline environment
cd services/pipeline
grep GCS_ENABLED ../../.env

# Check frontend API configuration
cd echoridge_search_frontend_v2
cat .env.local | grep NEXT_PUBLIC_API_URL

No backend data found locally

# Generate sample data
cd pmf_finder_backend
python run_pipeline.py --query "Schools in Tampa" --max 10

# Check data was created
find data -name "*.jsonl" -type f | head -5

# List recent runs
python run_pipeline.py --list-runs

Performance Optimization

Faster ETL processing

# Use condensed scores for large datasets
docker-compose -f docker-compose.catalog.yml run --rm catalog-etl python main.py ingest --latest --fast

Frontend caching

SQLite WASM cache persists between sessions
Clear cache: Browser DevTools > Storage > Clear All
Disable cache: Set NEXT_PUBLIC_DB_DRIVER=indexeddb

Data Pipeline Issues

API Rate Limits

# Reduce concurrency in .env
GOOGLE_PLACES_REQUESTS_PER_MINUTE=300
OPENAI_REQUESTS_PER_MINUTE=200

Incomplete scoring

# Check scoring logs
cd pmf_finder_backend
tail -f data/latest_run/logs/scoring.log

API Key Issues: Verify GOOGLE_PLACES_API_KEY, OPENAI_API_KEY, and FIRECRAWL_API_KEY in .env
Rate Limits: Adjust *_REQUESTS_PER_MINUTE variables based on your API tier
Geographic Issues: Check predefined metro list or verify internet for geocoding
Web Dashboard: Ensure Flask dependencies installed: pip install -r flask_interface/requirements.txt