← Back to Projects

Poly-DB - Polymarket Vector Derivatives

Find similar markets, detect derivatives, and discover arbitrage opportunities through semantic vector search

Quick Start • Features • CLI Usage • API Endpoints

Features

Semantic Search: Find markets using natural language queries
Derivative Detection: Identify related markets based on semantic similarity
Arbitrage Detection: Find pricing inefficiencies between similar markets
Market Clustering: Group markets by topic automatically
FastAPI Backend: Production-ready REST API
Minimal Web UI: Clean Flask frontend for exploration
CLI Tool: Command-line interface for batch operations
Docker Support: Fully containerized deployment

Architecture

Embedding Model: Sentence Transformers (all-MiniLM-L6-v2, 384 dimensions)
Vector Database: ChromaDB with persistent local storage
Market Data: Polymarket Gamma API (political markets)
Backend: FastAPI with async support
Frontend: Flask with minimal vanilla JavaScript
Storage: Local ChromaDB for vectors, no external dependencies

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Scrape and Vectorize Markets

python cli.py scrape --max-markets 1000

This will fetch political markets from Polymarket, extract text, generate embeddings using Sentence Transformers, and store in ChromaDB. Progress: ~5-10 minutes for 1000 markets.

3. Run API Server

python api.py

API available at http://localhost:8000

4. Run Frontend

python frontend.py

UI available at http://localhost:5000

Docker Deployment

# Build and Run
docker-compose up -d

# Initial Scrape
docker-compose exec api python cli.py scrape --max-markets 1000

CLI Usage

Scrape Markets

# Scrape active political markets
python cli.py scrape --max-markets 1000

# Include closed markets
python cli.py scrape --max-markets 1000 --closed

# Custom database path
python cli.py scrape --db-path /path/to/db --max-markets 500

Search Markets

# Natural language search
python cli.py search "Trump wins 2024" --limit 10

# Custom similarity threshold
python cli.py search "senate control" --limit 20

Find Similar Markets

python cli.py similar 12345 --limit 10

Find Derivatives

python cli.py derivatives 12345 --min-similarity 0.75 --limit 20

Detect Arbitrage

python cli.py arbitrage 12345 --min-similarity 0.90 --min-price-diff 0.05

API Endpoints

GET /stats

Get database statistics

{
  "total_markets": 847,
  "name": "polymarket_political_markets",
  "metadata": {
    "embedding_model": "all-MiniLM-L6-v2",
    "embedding_dimensions": 384
  }
}

POST /search

Natural language search

// Request
{
  "query": "Trump wins 2024",
  "limit": 10
}

// Response
{
  "query": "Trump wins 2024",
  "results": [
    {
      "id": "12345",
      "similarity": 0.92,
      "metadata": {
        "question": "Will Trump win the 2024 election?",
        "volume": 150000,
        "last_price": 0.65,
        "active": true
      }
    }
  ]
}

POST /derivatives

Find derivative markets

// Request
{
  "market_id": "12345",
  "min_similarity": 0.75,
  "max_results": 20
}

// Response
{
  "market_id": "12345",
  "derivatives": [
    {
      "id": "67890",
      "similarity": 0.88,
      "relationship": "strong_derivative",
      "metadata": {...}
    }
  ]
}

POST /arbitrage

Find arbitrage opportunities

// Request
{
  "market_id": "12345",
  "min_similarity": 0.90,
  "min_price_diff": 0.05
}

// Response
{
  "market_id": "12345",
  "opportunities": [
    {
      "market_a": {
        "id": "12345",
        "question": "Trump wins 2024",
        "price": 0.65
      },
      "market_b": {
        "id": "67890",
        "question": "Republican wins 2024",
        "price": 0.60
      },
      "similarity": 0.93,
      "price_diff": 0.05,
      "arb_type": "subset",
      "expected_profit": 0.05,
      "confidence": "high"
    }
  ]
}

Similarity Interpretation

Score Range	Interpretation
0.95-1.00	Near duplicates (same question, different outcome)
0.85-0.95	Strong derivatives (closely related)
0.75-0.85	Moderate derivatives (related but distinct)
0.65-0.75	Weak correlation
< 0.65	Unrelated

Performance

Initial Scrape (1000 markets)

Scraping: 2-5 minutes
Text processing: 5 seconds
Embedding generation: 2-5 minutes (CPU)
Storage: 10 seconds
Total: ~5-10 minutes

Query Performance

Single market search: < 50ms
Batch query (10 markets): < 200ms
Text search: < 100ms

Storage

1000 markets: ~150-200 MB
Per market: ~3-7 KB
Embedding size: 384 floats × 4 bytes = 1.5 KB

Project Structure

vectorization_derivatives/
├── gamma_api.py              # Polymarket API client
├── text_processor.py         # Text extraction
├── embeddings.py             # Sentence Transformers wrapper
├── chroma_client.py          # ChromaDB interface
├── derivatives.py            # Analysis functions
├── market_vectorizer.py      # Orchestrator
├── cli.py                    # CLI tool
├── api.py                    # FastAPI backend
├── frontend.py               # Flask frontend
├── templates/
│   └── index.html            # Web UI
├── requirements.txt          # Dependencies
├── docker-compose.yml        # Docker orchestration
├── Dockerfile.api            # API container
├── Dockerfile.frontend       # Frontend container
└── README.md                 # Documentation