Knowledge Base Semantic Search π§
The Dataproc MCP Server includes an advanced knowledge base semantic search feature that enables natural language queries against stored cluster configurations, job data, and operational insights. This feature provides intelligent data extraction and search capabilities with graceful degradation when optional components are unavailable.
π― How Semantic Search Works
The following diagram illustrates the difference between standard queries and semantic search using a real example:
graph LR
A["π User Query:<br/>'show me clusters with machine learning packages'"] --> B{"Qdrant Available?"}
B -->|"β
Semantic Search ON"| C["π§ Semantic Processing"]
B -->|"β Standard Mode"| D["π Standard Processing"]
C --> C1["Vector Embedding Generation"]
C1 --> C2["Similarity Search in Knowledge Base"]
C2 --> C3["Confidence Scoring & Ranking"]
C3 --> C4["π― Intelligent Results"]
D --> D1["Keyword-based Filtering"]
D1 --> D2["Standard Data Retrieval"]
D2 --> D3["π Standard Results + Setup Guidance"]
C4 --> E["π Response Comparison"]
D3 --> E
E --> F["π WITH SEMANTIC SEARCH<br/><br/>π― Query: 'machine learning packages' (confidence: 0.89)<br/><br/>β¨ Found 3 relevant clusters:<br/><br/>π€ ml-training (confidence: 0.92)<br/> β’ ML Packages: tensorflow, torch<br/> β’ Machine: n1-highmem-8<br/><br/>π analytics-prod (confidence: 0.87)<br/> β’ ML Packages: sklearn, pandas<br/> β’ Machine: n1-standard-16<br/><br/>βοΈ data-pipeline (confidence: 0.84)<br/> β’ ML Packages: numpy, scipy<br/> β’ Machine: n1-standard-4<br/><br/>π‘ Sources: pip packages, init scripts, configs"]
E --> G["π WITHOUT SEMANTIC SEARCH<br/><br/>β οΈ Enhanced search unavailable - Qdrant not connected<br/>π Showing standard cluster information instead<br/><br/>π Quick Setup:<br/> docker run -p 6334:6333 qdrant/qdrant<br/><br/>π Standard Cluster List:<br/><br/>π’ ml-training β’ RUNNING β’ n1-highmem-8<br/>π’ analytics-prod β’ RUNNING β’ n1-standard-16<br/>π’ data-pipeline β’ RUNNING β’ n1-standard-4<br/>π’ web-cluster β’ RUNNING β’ n1-standard-2<br/>β test-cluster β’ STOPPED β’ n1-standard-1<br/><br/>π No package analysis available"]
style C fill:#e1f5fe
style C4 fill:#c8e6c9
style D fill:#fff3e0
style D3 fill:#ffecb3
style F fill:#e8f5e8
style G fill:#fff8e1
π Feature Overview
What is Knowledge Base Semantic Search?
The knowledge base semantic search transforms how you interact with your Dataproc infrastructure by:
- Natural Language Queries: Ask questions like βshow me clusters with pip packagesβ or βfind high-memory configurationsβ
- Intelligent Data Extraction: Automatically extracts and indexes meaningful information from cluster configurations
- Semantic Understanding: Uses vector embeddings to understand query intent beyond keyword matching
- Comprehensive Knowledge Base: Stores cluster configs, job patterns, error histories, and operational insights
Benefits Over Regular Queries
Traditional Queries | Semantic Queries |
---|---|
Exact keyword matching | Intent-based understanding |
Rigid syntax requirements | Natural language flexibility |
Limited context awareness | Rich contextual relationships |
Manual data filtering | Intelligent relevance scoring |
Shows all data | Shows relevant data with confidence |
Optional Enhancement with Graceful Degradation
π― Key Feature: This is an optional enhancement that doesnβt break core functionality:
- β With Qdrant: Full semantic search with confidence scores and vector similarity
- β Without Qdrant: Clean formatted output with helpful setup guidance
- β Core Operations: Always work regardless of Qdrant availability
- β Zero Dependencies: No required external services for basic functionality
π Setup Instructions
Prerequisites
- Node.js 18+ with the Dataproc MCP Server installed
- Docker (for Qdrant vector database)
- At least 2GB available RAM for optimal performance
Step 1: Install and Start Qdrant
# Pull and run Qdrant vector database
docker run -p 6334:6333 qdrant/qdrant
# Verify Qdrant is running
curl http://localhost:6334/health
Expected Response:
{"status":"ok"}
Step 2: Configure Response Filter
The semantic search uses the configuration in config/response-filter.json
:
{
"qdrant": {
"url": "http://localhost:6334",
"collectionName": "dataproc_knowledge",
"vectorSize": 384,
"distance": "Cosine"
}
}
Configuration Parameters:
url
: Qdrant server endpoint (default:http://localhost:6334
)collectionName
: Vector collection name (default:dataproc_knowledge
)vectorSize
: Embedding dimensions (384 for Transformers.js)distance
: Similarity metric (Cosine
,Euclidean
, orDot
)
Step 3: Verify Setup
Test the semantic search functionality:
# Check if Qdrant collections are created
curl http://localhost:6334/collections
# Test MCP server connection
# Use your MCP client to run: "List clusters with semantic query for machine types"
Step 4: Optional Port Configuration
If you need to use a different port:
{
"qdrant": {
"url": "http://localhost:6335",
"collectionName": "dataproc_knowledge",
"vectorSize": 384,
"distance": "Cosine"
}
}
Then start Qdrant with the custom port:
docker run -p 6335:6333 qdrant/qdrant
π Usage Examples
Basic Semantic Queries
Using query_cluster_data
Tool
// Natural language cluster data queries
{
"query": "pip packages installed on clusters",
"limit": 5
}
{
"query": "machine types and configurations",
"projectId": "my-project",
"region": "us-central1"
}
{
"query": "network configuration and subnets",
"clusterName": "my-cluster"
}
Using list_clusters
with Semantic Query
// Enhanced cluster listing with semantic filtering
{
"projectId": "my-project",
"region": "us-central1",
"semanticQuery": "high memory configurations"
}
{
"semanticQuery": "clusters with Jupyter notebooks",
"verbose": false
}
Using get_cluster
with Semantic Query
// Focused cluster details extraction
{
"projectId": "my-project",
"region": "us-central1",
"clusterName": "analytics-cluster",
"semanticQuery": "pip packages and Python libraries"
}
{
"projectId": "my-project",
"region": "us-central1",
"clusterName": "ml-cluster",
"semanticQuery": "machine types and worker configuration"
}
Advanced Query Examples
Infrastructure Analysis
{
"query": "clusters using preemptible instances for cost optimization",
"limit": 10
}
Component Discovery
{
"query": "Spark and Hadoop configurations with custom properties",
"projectId": "analytics-project"
}
Network and Security
{
"query": "service accounts and IAM configurations",
"region": "us-central1"
}
Performance Optimization
{
"query": "SSD disk configurations and storage optimization",
"limit": 8
}
βοΈ Configuration Details
Response Filter Configuration Structure
The config/response-filter.json
file controls all aspects of the semantic search:
{
"tokenLimits": {
"list_clusters": 500,
"get_cluster": 300,
"default": 400
},
"extractionRules": {
"list_clusters": {
"maxClusters": 10,
"essentialFields": [
"clusterName", "status", "machineType", "numWorkers"
],
"summaryFormat": "table"
}
},
"qdrant": {
"url": "http://localhost:6334",
"collectionName": "dataproc_knowledge",
"vectorSize": 384,
"distance": "Cosine"
},
"formatting": {
"useEmojis": true,
"compactTables": true,
"includeResourceLinks": true
}
}
Qdrant Collection Management
The system automatically manages two collections:
1. dataproc_knowledge
Collection
- Purpose: Stores extracted cluster knowledge and configurations
- Content: Machine types, pip packages, network configs, component lists
- Usage: Powers
query_cluster_data
and knowledge-based searches
2. dataproc_responses
Collection
- Purpose: Stores full API responses for token optimization
- Content: Complete cluster and job response data
- Usage: Enables resource URI access and response optimization
Configuration Options
Token Limits
{
"tokenLimits": {
"list_clusters": 500, // Max tokens for cluster lists
"get_cluster": 300, // Max tokens for single cluster
"submit_hive_query": 400, // Max tokens for query responses
"default": 400 // Default limit for other operations
}
}
Extraction Rules
{
"extractionRules": {
"list_clusters": {
"maxClusters": 10, // Limit clusters in response
"essentialFields": [...], // Key fields to include
"summaryFormat": "table" // Response format
},
"get_cluster": {
"essentialSections": [...], // Important config sections
"includeMetrics": false, // Performance data inclusion
"includeHistory": false // Historical data inclusion
}
}
}
Troubleshooting Configuration Issues
Qdrant Connection Problems
# Check if Qdrant is running
docker ps | grep qdrant
# Test connection
curl http://localhost:6334/health
# Check logs
docker logs $(docker ps -q --filter ancestor=qdrant/qdrant)
Collection Issues
# List collections
curl http://localhost:6334/collections
# Check collection info
curl http://localhost:6334/collections/dataproc_knowledge
Configuration Validation
# Validate JSON syntax
cat config/response-filter.json | jq .
# Check MCP server logs for configuration errors
# Look for "Qdrant" or "semantic" in log output
π Behavior Without Qdrant
Graceful Degradation Explanation
When Qdrant is not available, the system provides graceful degradation:
What Users See
Without Qdrant (Graceful Fallback):
π Semantic Query: "machine types"
β οΈ Enhanced search unavailable - Qdrant not connected
π Showing standard cluster information instead
π To enable semantic search:
1. Install: docker run -p 6334:6333 qdrant/qdrant
2. Verify: curl http://localhost:6334/health
[Standard cluster data follows...]
With Qdrant (Full Functionality):
π Semantic Query: "machine types" (confidence: 0.89)
π― Found 3 relevant clusters:
βββββββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ
β Cluster β Machine Type β Workers β Confidence β
βββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββΌβββββββββββββββ€
β analytics-prod β n1-highmem-8 β 4 β 0.92 β
β ml-training β n1-standard-16 β 8 β 0.87 β
β data-pipeline β n1-standard-4 β 2 β 0.84 β
βββββββββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ
Core Functionality Preservation
Always Available (No Qdrant Required):
- β Cluster creation and management
- β Job submission and monitoring
- β Profile management
- β Standard data retrieval
- β All MCP tool functionality
Enhanced with Qdrant:
- π Semantic search capabilities
- π Natural language queries
- π Intelligent data extraction
- π Confidence scoring
- π Vector similarity matching
Setup Guidance Integration
The system automatically provides setup guidance when semantic features are requested but Qdrant is unavailable:
π‘ Semantic Search Setup:
1. π³ Start Qdrant:
docker run -p 6334:6333 qdrant/qdrant
2. β
Verify Connection:
curl http://localhost:6334/health
3. π Restart MCP Server:
The server will automatically detect Qdrant and enable semantic features
4. π Documentation:
See docs/KNOWLEDGE_BASE_SEMANTIC_SEARCH.md for detailed setup
π― Best Practices
Query Optimization
Effective Query Patterns
// β
Good: Specific and focused
{ "query": "pip packages for machine learning" }
// β
Good: Infrastructure-focused
{ "query": "high-memory instances with SSD storage" }
// β Avoid: Too vague
{ "query": "stuff" }
// β Avoid: Too complex
{ "query": "show me all clusters with specific configurations that might be related to data processing or analytics workloads in production environments" }
Performance Tips
- Use Filters: Combine semantic queries with project/region filters
- Limit Results: Set appropriate
limit
values (5-10 for exploration) - Cache Results: Semantic queries are cached for 5 minutes by default
- Specific Queries: More specific queries return better results
Monitoring and Maintenance
Health Checks
# Daily Qdrant health check
curl http://localhost:6334/health
# Check collection sizes
curl http://localhost:6334/collections/dataproc_knowledge | jq '.result.points_count'
# Monitor MCP server logs for Qdrant connectivity
grep -i "qdrant\|semantic" /path/to/mcp/logs
Performance Monitoring
- Query Response Time: Should be < 2 seconds for most queries
- Collection Size: Monitor growth and consider cleanup policies
- Memory Usage: Qdrant typically uses 100-500MB for moderate datasets
π¨ Troubleshooting
Common Issues and Solutions
Issue: βQdrant connection failedβ
# Check if Qdrant is running
docker ps | grep qdrant
# If not running, start it
docker run -p 6334:6333 qdrant/qdrant
# Check port conflicts
lsof -i :6334
Issue: βCollection not foundβ
# Check existing collections
curl http://localhost:6334/collections
# Collections are auto-created, but you can manually create:
curl -X PUT http://localhost:6334/collections/dataproc_knowledge \
-H "Content-Type: application/json" \
-d '{"vectors": {"size": 384, "distance": "Cosine"}}'
Issue: βPoor search resultsβ
- Solution: Ensure you have indexed cluster data by running some
list_clusters
orget_cluster
operations first - Check: Verify collections have data:
curl http://localhost:6334/collections/dataproc_knowledge
Issue: βSemantic queries return empty resultsβ
# Check if embeddings service is working
# Look for "TransformersEmbeddingService" in MCP server logs
# Verify collection has vectors
curl http://localhost:6334/collections/dataproc_knowledge | jq '.result.points_count'
Getting Help
- Check Logs: Enable debug logging with
LOG_LEVEL=debug
- Verify Setup: Use the verification steps in the setup section
- Test Incrementally: Start with basic queries and add complexity
- Community Support: Check GitHub Issues for similar problems
π Related Documentation
- Quick Start Guide - Basic MCP server setup
- Configuration Guide - Detailed configuration options
- API Reference - Complete tool documentation
- Response Optimization Guide - Token optimization features
π Ready to explore your Dataproc infrastructure with natural language queries!
The knowledge base semantic search transforms complex infrastructure queries into simple, intuitive conversations. Start with basic queries and discover the power of semantic understanding in your data operations.