πŸ”§ Dataproc MCP Server Documentation

Production-ready Model Context Protocol server for Google Cloud Dataproc operations.

Knowledge Base Semantic Search 🧠

The Dataproc MCP Server includes an advanced knowledge base semantic search feature that enables natural language queries against stored cluster configurations, job data, and operational insights. This feature provides intelligent data extraction and search capabilities with graceful degradation when optional components are unavailable.

🎯 How Semantic Search Works

The following diagram illustrates the difference between standard queries and semantic search using a real example:

graph LR
    A["πŸ” User Query:<br/>'show me clusters with machine learning packages'"] --> B{"Qdrant Available?"}
    
    B -->|"βœ… Semantic Search ON"| C["🧠 Semantic Processing"]
    B -->|"❌ Standard Mode"| D["πŸ“‹ Standard Processing"]
    
    C --> C1["Vector Embedding Generation"]
    C1 --> C2["Similarity Search in Knowledge Base"]
    C2 --> C3["Confidence Scoring & Ranking"]
    C3 --> C4["🎯 Intelligent Results"]
    
    D --> D1["Keyword-based Filtering"]
    D1 --> D2["Standard Data Retrieval"]
    D2 --> D3["πŸ“Š Standard Results + Setup Guidance"]
    
    C4 --> E["πŸ“Š Response Comparison"]
    D3 --> E
    
    E --> F["πŸš€ WITH SEMANTIC SEARCH<br/><br/>🎯 Query: 'machine learning packages' (confidence: 0.89)<br/><br/>✨ Found 3 relevant clusters:<br/><br/>πŸ€– ml-training (confidence: 0.92)<br/>   β€’ ML Packages: tensorflow, torch<br/>   β€’ Machine: n1-highmem-8<br/><br/>πŸ“Š analytics-prod (confidence: 0.87)<br/>   β€’ ML Packages: sklearn, pandas<br/>   β€’ Machine: n1-standard-16<br/><br/>βš™οΈ data-pipeline (confidence: 0.84)<br/>   β€’ ML Packages: numpy, scipy<br/>   β€’ Machine: n1-standard-4<br/><br/>πŸ’‘ Sources: pip packages, init scripts, configs"]
    
    E --> G["πŸ“‹ WITHOUT SEMANTIC SEARCH<br/><br/>⚠️ Enhanced search unavailable - Qdrant not connected<br/>πŸ“‹ Showing standard cluster information instead<br/><br/>πŸš€ Quick Setup:<br/>   docker run -p 6334:6333 qdrant/qdrant<br/><br/>πŸ“‹ Standard Cluster List:<br/><br/>🟒 ml-training β€’ RUNNING β€’ n1-highmem-8<br/>🟒 analytics-prod β€’ RUNNING β€’ n1-standard-16<br/>🟒 data-pipeline β€’ RUNNING β€’ n1-standard-4<br/>🟒 web-cluster β€’ RUNNING β€’ n1-standard-2<br/>β­• test-cluster β€’ STOPPED β€’ n1-standard-1<br/><br/>πŸ“ No package analysis available"]

    style C fill:#e1f5fe
    style C4 fill:#c8e6c9
    style D fill:#fff3e0
    style D3 fill:#ffecb3
    style F fill:#e8f5e8
    style G fill:#fff8e1

🌟 Feature Overview

The knowledge base semantic search transforms how you interact with your Dataproc infrastructure by:

Benefits Over Regular Queries

Traditional Queries Semantic Queries
Exact keyword matching Intent-based understanding
Rigid syntax requirements Natural language flexibility
Limited context awareness Rich contextual relationships
Manual data filtering Intelligent relevance scoring
Shows all data Shows relevant data with confidence

Optional Enhancement with Graceful Degradation

🎯 Key Feature: This is an optional enhancement that doesn’t break core functionality:

πŸš€ Setup Instructions

Prerequisites

Step 1: Install and Start Qdrant

# Pull and run Qdrant vector database
docker run -p 6334:6333 qdrant/qdrant

# Verify Qdrant is running
curl http://localhost:6334/health

Expected Response:

{"status":"ok"}

Step 2: Configure Response Filter

The semantic search uses the configuration in config/response-filter.json:

{
  "qdrant": {
    "url": "http://localhost:6334",
    "collectionName": "dataproc_knowledge",
    "vectorSize": 384,
    "distance": "Cosine"
  }
}

Configuration Parameters:

Step 3: Verify Setup

Test the semantic search functionality:

# Check if Qdrant collections are created
curl http://localhost:6334/collections

# Test MCP server connection
# Use your MCP client to run: "List clusters with semantic query for machine types"

Step 4: Optional Port Configuration

If you need to use a different port:

{
  "qdrant": {
    "url": "http://localhost:6335",
    "collectionName": "dataproc_knowledge",
    "vectorSize": 384,
    "distance": "Cosine"
  }
}

Then start Qdrant with the custom port:

docker run -p 6335:6333 qdrant/qdrant

πŸ“– Usage Examples

Basic Semantic Queries

Using query_cluster_data Tool

// Natural language cluster data queries
{
  "query": "pip packages installed on clusters",
  "limit": 5
}

{
  "query": "machine types and configurations",
  "projectId": "my-project",
  "region": "us-central1"
}

{
  "query": "network configuration and subnets",
  "clusterName": "my-cluster"
}

Using list_clusters with Semantic Query

// Enhanced cluster listing with semantic filtering
{
  "projectId": "my-project",
  "region": "us-central1",
  "semanticQuery": "high memory configurations"
}

{
  "semanticQuery": "clusters with Jupyter notebooks",
  "verbose": false
}

Using get_cluster with Semantic Query

// Focused cluster details extraction
{
  "projectId": "my-project",
  "region": "us-central1", 
  "clusterName": "analytics-cluster",
  "semanticQuery": "pip packages and Python libraries"
}

{
  "projectId": "my-project",
  "region": "us-central1",
  "clusterName": "ml-cluster", 
  "semanticQuery": "machine types and worker configuration"
}

Advanced Query Examples

Infrastructure Analysis

{
  "query": "clusters using preemptible instances for cost optimization",
  "limit": 10
}

Component Discovery

{
  "query": "Spark and Hadoop configurations with custom properties",
  "projectId": "analytics-project"
}

Network and Security

{
  "query": "service accounts and IAM configurations",
  "region": "us-central1"
}

Performance Optimization

{
  "query": "SSD disk configurations and storage optimization",
  "limit": 8
}

βš™οΈ Configuration Details

Response Filter Configuration Structure

The config/response-filter.json file controls all aspects of the semantic search:

{
  "tokenLimits": {
    "list_clusters": 500,
    "get_cluster": 300,
    "default": 400
  },
  "extractionRules": {
    "list_clusters": {
      "maxClusters": 10,
      "essentialFields": [
        "clusterName", "status", "machineType", "numWorkers"
      ],
      "summaryFormat": "table"
    }
  },
  "qdrant": {
    "url": "http://localhost:6334",
    "collectionName": "dataproc_knowledge",
    "vectorSize": 384,
    "distance": "Cosine"
  },
  "formatting": {
    "useEmojis": true,
    "compactTables": true,
    "includeResourceLinks": true
  }
}

Qdrant Collection Management

The system automatically manages two collections:

1. dataproc_knowledge Collection

2. dataproc_responses Collection

Configuration Options

Token Limits

{
  "tokenLimits": {
    "list_clusters": 500,    // Max tokens for cluster lists
    "get_cluster": 300,      // Max tokens for single cluster
    "submit_hive_query": 400, // Max tokens for query responses
    "default": 400           // Default limit for other operations
  }
}

Extraction Rules

{
  "extractionRules": {
    "list_clusters": {
      "maxClusters": 10,           // Limit clusters in response
      "essentialFields": [...],     // Key fields to include
      "summaryFormat": "table"      // Response format
    },
    "get_cluster": {
      "essentialSections": [...],   // Important config sections
      "includeMetrics": false,      // Performance data inclusion
      "includeHistory": false       // Historical data inclusion
    }
  }
}

Troubleshooting Configuration Issues

Qdrant Connection Problems

# Check if Qdrant is running
docker ps | grep qdrant

# Test connection
curl http://localhost:6334/health

# Check logs
docker logs $(docker ps -q --filter ancestor=qdrant/qdrant)

Collection Issues

# List collections
curl http://localhost:6334/collections

# Check collection info
curl http://localhost:6334/collections/dataproc_knowledge

Configuration Validation

# Validate JSON syntax
cat config/response-filter.json | jq .

# Check MCP server logs for configuration errors
# Look for "Qdrant" or "semantic" in log output

πŸ”„ Behavior Without Qdrant

Graceful Degradation Explanation

When Qdrant is not available, the system provides graceful degradation:

What Users See

Without Qdrant (Graceful Fallback):

πŸ” Semantic Query: "machine types"

⚠️  Enhanced search unavailable - Qdrant not connected
πŸ“‹ Showing standard cluster information instead

πŸš€ To enable semantic search:
   1. Install: docker run -p 6334:6333 qdrant/qdrant  
   2. Verify: curl http://localhost:6334/health

[Standard cluster data follows...]

With Qdrant (Full Functionality):

πŸ” Semantic Query: "machine types" (confidence: 0.89)

🎯 Found 3 relevant clusters:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Cluster         β”‚ Machine Type     β”‚ Workers     β”‚ Confidence   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ analytics-prod  β”‚ n1-highmem-8     β”‚ 4           β”‚ 0.92         β”‚
β”‚ ml-training     β”‚ n1-standard-16   β”‚ 8           β”‚ 0.87         β”‚
β”‚ data-pipeline   β”‚ n1-standard-4    β”‚ 2           β”‚ 0.84         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Functionality Preservation

Always Available (No Qdrant Required):

Enhanced with Qdrant:

Setup Guidance Integration

The system automatically provides setup guidance when semantic features are requested but Qdrant is unavailable:

πŸ’‘ Semantic Search Setup:

1. 🐳 Start Qdrant:
   docker run -p 6334:6333 qdrant/qdrant

2. βœ… Verify Connection:
   curl http://localhost:6334/health

3. πŸ”„ Restart MCP Server:
   The server will automatically detect Qdrant and enable semantic features

4. πŸ“– Documentation:
   See docs/KNOWLEDGE_BASE_SEMANTIC_SEARCH.md for detailed setup

🎯 Best Practices

Query Optimization

Effective Query Patterns

// βœ… Good: Specific and focused
{ "query": "pip packages for machine learning" }

// βœ… Good: Infrastructure-focused  
{ "query": "high-memory instances with SSD storage" }

// ❌ Avoid: Too vague
{ "query": "stuff" }

// ❌ Avoid: Too complex
{ "query": "show me all clusters with specific configurations that might be related to data processing or analytics workloads in production environments" }

Performance Tips

  1. Use Filters: Combine semantic queries with project/region filters
  2. Limit Results: Set appropriate limit values (5-10 for exploration)
  3. Cache Results: Semantic queries are cached for 5 minutes by default
  4. Specific Queries: More specific queries return better results

Monitoring and Maintenance

Health Checks

# Daily Qdrant health check
curl http://localhost:6334/health

# Check collection sizes
curl http://localhost:6334/collections/dataproc_knowledge | jq '.result.points_count'

# Monitor MCP server logs for Qdrant connectivity
grep -i "qdrant\|semantic" /path/to/mcp/logs

Performance Monitoring

🚨 Troubleshooting

Common Issues and Solutions

Issue: β€œQdrant connection failed”

# Check if Qdrant is running
docker ps | grep qdrant

# If not running, start it
docker run -p 6334:6333 qdrant/qdrant

# Check port conflicts
lsof -i :6334

Issue: β€œCollection not found”

# Check existing collections
curl http://localhost:6334/collections

# Collections are auto-created, but you can manually create:
curl -X PUT http://localhost:6334/collections/dataproc_knowledge \
  -H "Content-Type: application/json" \
  -d '{"vectors": {"size": 384, "distance": "Cosine"}}'

Issue: β€œPoor search results”

Issue: β€œSemantic queries return empty results”

# Check if embeddings service is working
# Look for "TransformersEmbeddingService" in MCP server logs

# Verify collection has vectors
curl http://localhost:6334/collections/dataproc_knowledge | jq '.result.points_count'

Getting Help

  1. Check Logs: Enable debug logging with LOG_LEVEL=debug
  2. Verify Setup: Use the verification steps in the setup section
  3. Test Incrementally: Start with basic queries and add complexity
  4. Community Support: Check GitHub Issues for similar problems

πŸŽ‰ Ready to explore your Dataproc infrastructure with natural language queries!

The knowledge base semantic search transforms complex infrastructure queries into simple, intuitive conversations. Start with basic queries and discover the power of semantic understanding in your data operations.