← Back to Projects

EchoRidge Search - AI-Powered Business Intelligence Platform

Python Next.js FastAPI GCS Docker

Complete business intelligence platform: AI pipeline + cloud storage + real-time visualization


Project Echo Ridge - AI Business Intelligence Pipeline

🏗️ Architecture Overview

EchoRidge Search consists of three integrated systems:

Pipeline Backend (Batch Processing)

CLI Input → Places Discovery → Web Scraping → Hybrid AI Scoring → GCS Storage

Pipeline API (Orchestration)

Frontend → Pipeline API → PMF Backend → GCS Upload → Status Polling

Frontend v2 (Real-time Visualization)

User Interface → Pipeline API → Run Status → GCS Results → Map Visualization

Complete Data Flow

User Query → Pipeline API (8082) → PMF Backend → JSONL Files → GCS Bucket
                                                                    ↓
Frontend (3000) ← Status Updates ← Pipeline API ← GCS Run Data

🎯 Key Features

🚀 Quick Start

Prerequisites

  1. Required API Keys (add to .env file):
    • GOOGLE_PLACES_API_KEY - For place discovery
    • FIRECRAWL_API_KEY - For web scraping
    • OPENAI_API_KEY - For AI scoring
  2. Google Cloud Setup (for GCS storage):
    • Create a GCS bucket
    • Set up service account credentials
    • Download credentials JSON file

Option 1: Full Stack (Pipeline + Frontend)

# 1. Setup environment
cp .env.example .env
# Edit .env with your API keys and GCS configuration

# 2. Install backend dependencies
cd pmf_finder_backend
pip install -r requirements.txt

# 3. Start Pipeline API
cd ../services/pipeline
python main.py &  # Runs on port 8082

# 4. Start Frontend v2
cd ../../echoridge_search_frontend_v2
npm install
npm run dev  # Visit http://localhost:3000

Option 2: CLI Pipeline Only

# Run pipeline directly with CLI
cd pmf_finder_backend
python run_pipeline.py --query "Private schools in Tampa" --max 25

# With GCS upload enabled
export GCS_ENABLED=true
export GCS_BUCKET_NAME=your-bucket-name
export GCS_CREDENTIALS_PATH=/path/to/credentials.json
python run_pipeline.py --query "Restaurants in Austin" --max 50

Option 3: API-Triggered Pipeline

# Start Pipeline API
cd services/pipeline
python main.py

# Trigger from another terminal
curl -X POST http://localhost:8082/v1/pipeline/execute \
  -H "Content-Type: application/json" \
  -d '{"query": "coffee shops in Seattle", "max_results": 20}'

# Check status
curl http://localhost:8082/v1/pipeline/status/{job_id}

📦 Services Architecture

Pipeline Backend (pmf_finder_backend/)

Pipeline API (services/pipeline/)

GCS Storage Client (pmf_finder_backend/storage/)

Frontend v2 (echoridge_search_frontend_v2/)

Storage Layer

🛠️ Development Commands

# Backend Pipeline
cd pmf_finder_backend
python run_pipeline.py --query "Schools in NYC" --max 50

# ETL Operations
cd services/etl
python main.py ingest --latest
python main.py status
python main.py list-runs

# API Development
cd services/api
python main.py  # Start on localhost:8081

# Frontend Development
cd echoridge_search_frontend
npm run dev  # Start on localhost:3000

🌐 User Interfaces

Modern Search Frontend (Port 3000)

Legacy Flask Dashboard (Port 5000)

🔌 API Endpoints

Catalog API (localhost:8081)

# Company search with pagination
GET /v1/catalog/companies?q=schools®ion=florida&limit=20

# Single company details
GET /v1/catalog/companies/{id}

# Evidence for a company
GET /v1/catalog/companies/{id}/evidence

# Statistics and health
GET /v1/catalog/stats
GET /health

Legacy Flask API (localhost:5000)

GET /api/runs           # Available pipeline runs
GET /api/analytics      # Aggregated statistics
GET /api/heatmap-data   # Geographic distribution

⚙️ Configuration

Pipeline Environment (.env)

# Required API Keys
GOOGLE_PLACES_API_KEY="your_google_places_api_key"
OPENAI_API_KEY="your_openai_api_key"
FIRECRAWL_API_KEY="your_firecrawl_api_key"

# Google Cloud Storage Configuration
GCS_ENABLED=true
GCS_BUCKET_NAME="your-bucket-name"
GCS_CREDENTIALS_PATH="/path/to/service-account-key.json"
GCP_PROJECT_ID="your-project-id"

# Hybrid Scoring Configuration
HYBRID_SCORING_ENABLED=true
EMBEDDED_ECHO_RIDGE_ENABLED=true
ECHO_RIDGE_HOST=127.0.0.1
ECHO_RIDGE_PORT=8070

# AI Model Configuration
GPT_MODEL="gpt-4o-mini"
GOOGLE_PLACES_REQUESTS_PER_MINUTE=600
OPENAI_REQUESTS_PER_MINUTE=500

Frontend v2 Environment (.env.local)

# Pipeline API endpoint
NEXT_PUBLIC_API_URL=http://localhost:8082

# Optional: Disable telemetry
NEXT_TELEMETRY_DISABLED=1

Geographic Coverage

Predefined Metro Areas (instant resolution):

Chicago, NYC, Los Angeles, Boston, Seattle, Denver, Phoenix, Miami,
Atlanta, Dallas, Houston, San Francisco, Portland, Nashville, Detroit,
Minneapolis, Tampa, Orlando, Charlotte, Indianapolis, Columbus, Austin,
Jacksonville, San Antonio, San Diego, Fort Worth, Philadelphia,
Washington DC, Las Vegas, Rapid City, Syracuse, Madison + more...

Dynamic Geocoding: Automatic fallback to Google Geocoding API for unlisted regions

Project Structure

pmf_finder_backend/
├── run_pipeline.py              # Main CLI entry point
├── input/
│   ├── cli.py                   # Query processing & orchestration
│   └── geofence.py              # Geographic boundary resolution
├── places_search/
│   └── places_google.py         # Google Places API integration
├── dedupe.py                    # Multi-strategy deduplication
├── scraping_module/
│   └── scrape_firecrawl.py      # Web content extraction
├── llm_scoring/
│   └── llm_score.py             # OpenAI GPT-4 DIMB analysis
├── flask_interface/
│   ├── app.py                   # Flask web application
│   ├── templates/               # Web dashboard UI
│   └── static/                  # Assets & styling
├── data/                        # Organized output storage
│   └── YYYY/MM/DD/run_id/       # Timestamped hierarchies
└── common_models.py             # Pydantic data validation

Advanced Usage

Custom Queries

# Geographic specificity
python run_pipeline.py --query "Dentists in Rapid City South Dakota" --max 30

# Category + location patterns
python run_pipeline.py --query "Coffee shops in Syracuse NY" --max 40
python run_pipeline.py --query "Tech companies in Madison Wisconsin" --max 50

Output Structure

Each run creates organized data locally and in GCS at results/YYYY/MM/DD/run_id/:

├── query.json              # Original query & geofence
├── run.json                # Execution metadata & statistics
├── manifest.json           # GCS upload metadata
├── places/
│   ├── places_raw.jsonl    # Raw API responses
│   └── places_norm.jsonl   # Discovered & normalized businesses
├── scrapes/
│   └── scrapes.jsonl       # Web content snapshots
└── scores/
    ├── scores.jsonl        # Full AI scorecards with reasoning
    ├── scores_condensed.jsonl  # Summary data
    └── hybrid_results.jsonl    # Combined AI + deterministic scores

Run ID Format: YYYYMMDD_HHMMSS_category_region

🔧 Troubleshooting

Common Issues

Pipeline API not accessible

# Check if Pipeline API is running
curl http://localhost:8082/health

# Start Pipeline API if needed
cd services/pipeline
python main.py

GCS upload failing

# Verify GCS configuration
echo $GCS_ENABLED
echo $GCS_BUCKET_NAME

# Test credentials
gcloud auth application-default login
# OR set credentials path
export GCS_CREDENTIALS_PATH=/path/to/credentials.json

# Verify bucket exists and is accessible
gsutil ls gs://$GCS_BUCKET_NAME

Frontend not showing runs

# Check Pipeline API GCS endpoints
curl http://localhost:8082/v1/runs/list

# Verify GCS_ENABLED in pipeline environment
cd services/pipeline
grep GCS_ENABLED ../../.env

# Check frontend API configuration
cd echoridge_search_frontend_v2
cat .env.local | grep NEXT_PUBLIC_API_URL

No backend data found locally

# Generate sample data
cd pmf_finder_backend
python run_pipeline.py --query "Schools in Tampa" --max 10

# Check data was created
find data -name "*.jsonl" -type f | head -5

# List recent runs
python run_pipeline.py --list-runs

Performance Optimization

Faster ETL processing

# Use condensed scores for large datasets
docker-compose -f docker-compose.catalog.yml run --rm catalog-etl python main.py ingest --latest --fast

Frontend caching

Data Pipeline Issues

API Rate Limits

# Reduce concurrency in .env
GOOGLE_PLACES_REQUESTS_PER_MINUTE=300
OPENAI_REQUESTS_PER_MINUTE=200

Incomplete scoring

# Check scoring logs
cd pmf_finder_backend
tail -f data/latest_run/logs/scoring.log