Complete business intelligence platform: AI pipeline + cloud storage + real-time visualization
EchoRidge Search consists of three integrated systems:
CLI Input → Places Discovery → Web Scraping → Hybrid AI Scoring → GCS Storage
Frontend → Pipeline API → PMF Backend → GCS Upload → Status Polling
User Interface → Pipeline API → Run Status → GCS Results → Map Visualization
User Query → Pipeline API (8082) → PMF Backend → JSONL Files → GCS Bucket
↓
Frontend (3000) ← Status Updates ← Pipeline API ← GCS Run Data
.env file):
GOOGLE_PLACES_API_KEY - For place discoveryFIRECRAWL_API_KEY - For web scrapingOPENAI_API_KEY - For AI scoring# 1. Setup environment
cp .env.example .env
# Edit .env with your API keys and GCS configuration
# 2. Install backend dependencies
cd pmf_finder_backend
pip install -r requirements.txt
# 3. Start Pipeline API
cd ../services/pipeline
python main.py & # Runs on port 8082
# 4. Start Frontend v2
cd ../../echoridge_search_frontend_v2
npm install
npm run dev # Visit http://localhost:3000
# Run pipeline directly with CLI
cd pmf_finder_backend
python run_pipeline.py --query "Private schools in Tampa" --max 25
# With GCS upload enabled
export GCS_ENABLED=true
export GCS_BUCKET_NAME=your-bucket-name
export GCS_CREDENTIALS_PATH=/path/to/credentials.json
python run_pipeline.py --query "Restaurants in Austin" --max 50
# Start Pipeline API
cd services/pipeline
python main.py
# Trigger from another terminal
curl -X POST http://localhost:8082/v1/pipeline/execute \
-H "Content-Type: application/json" \
-d '{"query": "coffee shops in Seattle", "max_results": 20}'
# Check status
curl http://localhost:8082/v1/pipeline/status/{job_id}
pmf_finder_backend/)services/pipeline/)/v1/pipeline/execute - Trigger new pipeline run/v1/pipeline/status/{job_id} - Get execution status/v1/runs/list - List all runs from GCS/v1/runs/{run_id} - Get complete run data/v1/runs/{run_id}/map-data - Get visualization datapmf_finder_backend/storage/)results/YYYY/MM/DD/run_id/echoridge_search_frontend_v2/)# Backend Pipeline
cd pmf_finder_backend
python run_pipeline.py --query "Schools in NYC" --max 50
# ETL Operations
cd services/etl
python main.py ingest --latest
python main.py status
python main.py list-runs
# API Development
cd services/api
python main.py # Start on localhost:8081
# Frontend Development
cd echoridge_search_frontend
npm run dev # Start on localhost:3000
localhost:8081)# Company search with pagination
GET /v1/catalog/companies?q=schools®ion=florida&limit=20
# Single company details
GET /v1/catalog/companies/{id}
# Evidence for a company
GET /v1/catalog/companies/{id}/evidence
# Statistics and health
GET /v1/catalog/stats
GET /health
localhost:5000)GET /api/runs # Available pipeline runs
GET /api/analytics # Aggregated statistics
GET /api/heatmap-data # Geographic distribution
.env)# Required API Keys
GOOGLE_PLACES_API_KEY="your_google_places_api_key"
OPENAI_API_KEY="your_openai_api_key"
FIRECRAWL_API_KEY="your_firecrawl_api_key"
# Google Cloud Storage Configuration
GCS_ENABLED=true
GCS_BUCKET_NAME="your-bucket-name"
GCS_CREDENTIALS_PATH="/path/to/service-account-key.json"
GCP_PROJECT_ID="your-project-id"
# Hybrid Scoring Configuration
HYBRID_SCORING_ENABLED=true
EMBEDDED_ECHO_RIDGE_ENABLED=true
ECHO_RIDGE_HOST=127.0.0.1
ECHO_RIDGE_PORT=8070
# AI Model Configuration
GPT_MODEL="gpt-4o-mini"
GOOGLE_PLACES_REQUESTS_PER_MINUTE=600
OPENAI_REQUESTS_PER_MINUTE=500
.env.local)# Pipeline API endpoint
NEXT_PUBLIC_API_URL=http://localhost:8082
# Optional: Disable telemetry
NEXT_TELEMETRY_DISABLED=1
Predefined Metro Areas (instant resolution):
Chicago, NYC, Los Angeles, Boston, Seattle, Denver, Phoenix, Miami,
Atlanta, Dallas, Houston, San Francisco, Portland, Nashville, Detroit,
Minneapolis, Tampa, Orlando, Charlotte, Indianapolis, Columbus, Austin,
Jacksonville, San Antonio, San Diego, Fort Worth, Philadelphia,
Washington DC, Las Vegas, Rapid City, Syracuse, Madison + more...
Dynamic Geocoding: Automatic fallback to Google Geocoding API for unlisted regions
pmf_finder_backend/
├── run_pipeline.py # Main CLI entry point
├── input/
│ ├── cli.py # Query processing & orchestration
│ └── geofence.py # Geographic boundary resolution
├── places_search/
│ └── places_google.py # Google Places API integration
├── dedupe.py # Multi-strategy deduplication
├── scraping_module/
│ └── scrape_firecrawl.py # Web content extraction
├── llm_scoring/
│ └── llm_score.py # OpenAI GPT-4 DIMB analysis
├── flask_interface/
│ ├── app.py # Flask web application
│ ├── templates/ # Web dashboard UI
│ └── static/ # Assets & styling
├── data/ # Organized output storage
│ └── YYYY/MM/DD/run_id/ # Timestamped hierarchies
└── common_models.py # Pydantic data validation
# Geographic specificity
python run_pipeline.py --query "Dentists in Rapid City South Dakota" --max 30
# Category + location patterns
python run_pipeline.py --query "Coffee shops in Syracuse NY" --max 40
python run_pipeline.py --query "Tech companies in Madison Wisconsin" --max 50
Each run creates organized data locally and in GCS at results/YYYY/MM/DD/run_id/:
├── query.json # Original query & geofence
├── run.json # Execution metadata & statistics
├── manifest.json # GCS upload metadata
├── places/
│ ├── places_raw.jsonl # Raw API responses
│ └── places_norm.jsonl # Discovered & normalized businesses
├── scrapes/
│ └── scrapes.jsonl # Web content snapshots
└── scores/
├── scores.jsonl # Full AI scorecards with reasoning
├── scores_condensed.jsonl # Summary data
└── hybrid_results.jsonl # Combined AI + deterministic scores
Run ID Format: YYYYMMDD_HHMMSS_category_region
20251012_143025_private_schools_tampaPipeline API not accessible
# Check if Pipeline API is running
curl http://localhost:8082/health
# Start Pipeline API if needed
cd services/pipeline
python main.py
GCS upload failing
# Verify GCS configuration
echo $GCS_ENABLED
echo $GCS_BUCKET_NAME
# Test credentials
gcloud auth application-default login
# OR set credentials path
export GCS_CREDENTIALS_PATH=/path/to/credentials.json
# Verify bucket exists and is accessible
gsutil ls gs://$GCS_BUCKET_NAME
Frontend not showing runs
# Check Pipeline API GCS endpoints
curl http://localhost:8082/v1/runs/list
# Verify GCS_ENABLED in pipeline environment
cd services/pipeline
grep GCS_ENABLED ../../.env
# Check frontend API configuration
cd echoridge_search_frontend_v2
cat .env.local | grep NEXT_PUBLIC_API_URL
No backend data found locally
# Generate sample data
cd pmf_finder_backend
python run_pipeline.py --query "Schools in Tampa" --max 10
# Check data was created
find data -name "*.jsonl" -type f | head -5
# List recent runs
python run_pipeline.py --list-runs
Faster ETL processing
# Use condensed scores for large datasets
docker-compose -f docker-compose.catalog.yml run --rm catalog-etl python main.py ingest --latest --fast
Frontend caching
NEXT_PUBLIC_DB_DRIVER=indexeddbAPI Rate Limits
# Reduce concurrency in .env
GOOGLE_PLACES_REQUESTS_PER_MINUTE=300
OPENAI_REQUESTS_PER_MINUTE=200
Incomplete scoring
# Check scoring logs
cd pmf_finder_backend
tail -f data/latest_run/logs/scoring.log
GOOGLE_PLACES_API_KEY, OPENAI_API_KEY, and FIRECRAWL_API_KEY in .env*_REQUESTS_PER_MINUTE variables based on your API tierpip install -r flask_interface/requirements.txt