πŸ”§ Dataproc MCP Server Documentation

Production-ready Model Context Protocol server for Google Cloud Dataproc operations.

πŸ“š API Reference

Complete reference for all 17 Dataproc MCP Server tools with practical examples and usage patterns.

Overview

The Dataproc MCP Server provides 17 comprehensive tools organized into four categories:

Authentication

For detailed authentication setup and best practices, refer to the Authentication Implementation Guide.

All tools support intelligent default parameters. When projectId and region are not provided, the server automatically uses configured defaults from config/default-params.json.

Cluster Management Tools

1. start_dataproc_cluster

Creates a new Dataproc cluster with basic configuration.

Parameters:

Example:

{
  "tool": "start_dataproc_cluster",
  "arguments": {
    "clusterName": "my-analysis-cluster",
    "clusterConfig": {
      "masterConfig": {
        "numInstances": 1,
        "machineTypeUri": "n1-standard-4"
      },
      "workerConfig": {
        "numInstances": 3,
        "machineTypeUri": "n1-standard-2"
      }
    }
  }
}

Response:

{
  "content": [
    {
      "type": "text",
      "text": "Cluster my-analysis-cluster started successfully in region us-central1.\nCluster details:\n{\n  \"clusterName\": \"my-analysis-cluster\",\n  \"status\": {\n    \"state\": \"RUNNING\"\n  }\n}"
    }
  ]
}

2. create_cluster_from_yaml

Creates a cluster using a YAML configuration file.

Parameters:

Example:

{
  "tool": "create_cluster_from_yaml",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "yamlPath": "./configs/production-cluster.yaml",
    "overrides": {
      "clusterName": "prod-cluster-001"
    }
  }
}

3. create_cluster_from_profile

Creates a cluster using a predefined profile.

Parameters:

Example:

{
  "tool": "create_cluster_from_profile",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "profileName": "production/high-memory/analysis",
    "clusterName": "analytics-cluster-prod"
  }
}

4. list_clusters

Lists all Dataproc clusters in a project and region with intelligent response optimization.

Parameters:

Response Optimization:

Example (Optimized Response):

{
  "tool": "list_clusters",
  "arguments": {
    "filter": "status.state=RUNNING",
    "pageSize": 10
  }
}

Optimized Response:

{
  "content": [
    {
      "type": "text",
      "text": "Found 3 clusters in my-project-123/us-central1:\n\nβ€’ analytics-cluster-prod (RUNNING) - n1-standard-4, 5 nodes\nβ€’ data-pipeline-dev (RUNNING) - n1-standard-2, 3 nodes  \nβ€’ ml-training-cluster (CREATING) - n1-highmem-8, 10 nodes\n\nπŸ’Ύ Full details stored: dataproc://responses/clusters/list/abc123\nπŸ“Š Token reduction: 96.2% (7,651 β†’ 292 tokens)"
    }
  ]
}

Verbose Response:

{
  "tool": "list_clusters",
  "arguments": {
    "filter": "status.state=RUNNING",
    "pageSize": 10,
    "verbose": true
  }
}

Full Response (verbose=true):

{
  "content": [
    {
      "type": "text",
      "text": "Clusters in project my-project-123, region us-central1:\n{\n  \"clusters\": [\n    {\n      \"clusterName\": \"analytics-cluster-prod\",\n      \"status\": {\n        \"state\": \"RUNNING\",\n        \"stateStartTime\": \"2024-01-01T10:00:00Z\"\n      },\n      \"config\": {\n        \"masterConfig\": {\n          \"numInstances\": 1,\n          \"machineTypeUri\": \"n1-standard-4\"\n        },\n        \"workerConfig\": {\n          \"numInstances\": 4,\n          \"machineTypeUri\": \"n1-standard-4\"\n        }\n      }\n    }\n  ]\n}"
    }
  ]
}

5. get_cluster

Gets detailed information about a specific cluster with intelligent response optimization.

Parameters:

Response Optimization:

Example (Optimized Response):

{
  "tool": "get_cluster",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "clusterName": "my-analysis-cluster"
  }
}

Optimized Response:

{
  "content": [
    {
      "type": "text",
      "text": "Cluster: my-analysis-cluster (RUNNING)\nπŸ–₯️  Master: 1x n1-standard-4\nπŸ‘₯ Workers: 4x n1-standard-2\n🌐 Zone: us-central1-b\n⏰ Created: 2024-01-01 10:00 UTC\n\nπŸ’Ύ Full config: dataproc://responses/clusters/get/def456\nπŸ“Š Token reduction: 64.0% (553 β†’ 199 tokens)"
    }
  ]
}

Verbose Response:

{
  "tool": "get_cluster",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "clusterName": "my-analysis-cluster",
    "verbose": true
  }
}

6. delete_cluster

Deletes a Dataproc cluster.

Parameters:

Example:

{
  "tool": "delete_cluster",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "clusterName": "temporary-cluster"
  }
}

Job Execution Tools

7. submit_hive_query

Submits a Hive query to a Dataproc cluster.

Parameters:

Example:

{
  "tool": "submit_hive_query",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "clusterName": "analytics-cluster",
    "query": "SELECT customer_id, COUNT(*) as order_count FROM orders WHERE order_date >= '2024-01-01' GROUP BY customer_id ORDER BY order_count DESC LIMIT 100",
    "async": false,
    "queryOptions": {
      "timeoutMs": 300000,
      "properties": {
        "hive.exec.dynamic.partition": "true",
        "hive.exec.dynamic.partition.mode": "nonstrict"
      }
    }
  }
}

8. submit_dataproc_job

Submits a generic Dataproc job (Hive, Spark, PySpark, etc.) with enhanced local file staging support.

Parameters:

πŸ”§ LOCAL FILE STAGING:

The baseDirectory parameter in the local file staging system controls how relative file paths are resolved when using the template syntax {@./relative/path} or direct relative paths in job configurations.

Configuration: The baseDirectory parameter is configured in config/default-params.json with a default value of ".", which refers to the current working directory where the MCP server process is running (typically the project root directory).

Path Resolution Logic:

  1. Absolute Paths: If a file path is already absolute (starts with /), it’s used as-is
  2. Relative Path Resolution: For relative paths, the system:
    • Gets the baseDirectory value from configuration (default: ".")
    • Resolves the baseDirectory if it’s relative:
      • First tries to use DATAPROC_CONFIG_PATH environment variable’s directory
      • Falls back to process.cwd() (current working directory)
    • Combines baseDirectory with the relative file path

Template Syntax Support:

// Template syntax - recommended approach
{@./relative/path/to/file.py}
{@../parent/directory/file.jar}
{@subdirectory/file.sql}

// Direct relative paths (also supported)
"./relative/path/to/file.py"
"../parent/directory/file.jar"
"subdirectory/file.sql"

Practical Examples:

Example 1: Default Configuration (baseDirectory: ".")

Example 2: Config Directory Base

Example 3: Absolute Base Directory

Environment Variable Influence: The DATAPROC_CONFIG_PATH environment variable affects path resolution:

Best Practices:

  1. Use Template Syntax: Prefer {@./file.py} over direct relative paths for clarity
  2. Organize Files Relative to Project Root: With the default baseDirectory: ".", organize your files relative to the project root
  3. Consider Absolute Paths for External Files: For files outside the project structure, use absolute paths

Supported File Extensions:

Troubleshooting:

Debug Path Resolution: Enable debug logging to see the actual path resolution:

DEBUG=dataproc-mcp:* node build/index.js

Configuration Override: You can override the baseDirectory in your environment-specific configuration:

{
  "environment": "development",
  "parameters": {
    "baseDirectory": "./dev-scripts"
  }
}

Files are automatically staged to GCS and cleaned up after job completion.

Example - Spark Job:

{
  "tool": "submit_dataproc_job",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "clusterName": "spark-cluster",
    "jobType": "spark",
    "jobConfig": {
      "mainClass": "com.example.SparkApp",
      "jarFileUris": ["{@./spark-app.jar}"],
      "args": ["--input", "gs://my-bucket/input/", "--output", "gs://my-bucket/output/"],
      "properties": {
        "spark.executor.memory": "4g",
        "spark.executor.cores": "2"
      }
    },
    "async": true
  }
}

Example - PySpark Job with Local File Staging:

{
  "tool": "submit_dataproc_job",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "clusterName": "pyspark-cluster",
    "jobType": "pyspark",
    "jobConfig": {
      "mainPythonFileUri": "{@./test-spark-job.py}",
      "pythonFileUris": ["{@./utils/helper.py}", "{@/absolute/path/library.py}"],
      "args": ["--date", "2024-01-01"],
      "properties": {
        "spark.sql.adaptive.enabled": "true",
        "spark.sql.adaptive.coalescePartitions.enabled": "true"
      }
    }
  }
}

Example - Traditional PySpark Job (GCS URIs):

{
  "tool": "submit_dataproc_job",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "clusterName": "pyspark-cluster",
    "jobType": "pyspark",
    "jobConfig": {
      "mainPythonFileUri": "gs://my-bucket/scripts/data_processing.py",
      "pythonFileUris": ["gs://my-bucket/scripts/utils.py"],
      "args": ["--date", "2024-01-01"],
      "properties": {
        "spark.sql.adaptive.enabled": "true",
        "spark.sql.adaptive.coalescePartitions.enabled": "true"
      }
    }
  }
}

Local File Staging Process:

  1. Detection: Local file paths are automatically detected using template syntax
  2. Staging: Files are uploaded to the cluster’s staging bucket with unique names
  3. Transformation: Job config is updated with GCS URIs
  4. Execution: Job runs with staged files
  5. Cleanup: Staged files are automatically cleaned up after job completion

Supported File Extensions:

9. get_job_status

Gets the status of a Dataproc job.

Parameters:

Example:

{
  "tool": "get_job_status",
  "arguments": {
    "jobId": "job-12345-abcdef"
  }
}

Response:

{
  "content": [
    {
      "type": "text",
      "text": "Job status for job-12345-abcdef:\n{\n  \"status\": {\n    \"state\": \"DONE\",\n    \"stateStartTime\": \"2024-01-01T12:00:00Z\"\n  },\n  \"driverOutputResourceUri\": \"gs://bucket/output/\"\n}"
    }
  ]
}

10. get_query_results

Gets the results of a completed Hive query.

Parameters:

Example:

{
  "tool": "get_query_results",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "jobId": "hive-job-12345",
    "maxResults": 50
  }
}

11. get_job_results

Gets the results of a completed Dataproc job.

Parameters:

Example:

{
  "tool": "get_job_results",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "jobId": "spark-job-67890",
    "maxResults": 100
  }
}

12. cancel_dataproc_job

Cancels a running or pending Dataproc job with intelligent status handling and job tracking integration.

Parameters:

πŸ›‘ CANCELLATION WORKFLOW:

πŸ“Š STATUS HANDLING:

πŸ’‘ MONITORING: After cancellation, use get_job_status("jobId") to confirm the job reaches CANCELLED state.

Example:

{
  "tool": "cancel_dataproc_job",
  "arguments": {
    "jobId": "Clean_Places_sub_group_base_1_cleaned_places_13b6ec3f"
  }
}

Successful Cancellation Response:

{
  "content": [
    {
      "type": "text",
      "text": "πŸ›‘ Job Cancellation Status\n\nJob ID: Clean_Places_sub_group_base_1_cleaned_places_13b6ec3f\nStatus: 3\nMessage: Cancellation request sent for job Clean_Places_sub_group_base_1_cleaned_places_13b6ec3f."
    }
  ]
}

Job Already Completed Response:

{
  "content": [
    {
      "type": "text",
      "text": "Cannot cancel job Clean_Places_sub_group_base_1_cleaned_places_13b6ec3f in state: 'DONE'; cancellable states: '[PENDING, RUNNING]'"
    }
  ]
}

Use Cases:

Best Practices:

  1. Monitor job status before and after cancellation attempts
  2. Use with get_job_status to verify cancellation completion
  3. Handle gracefully when jobs are already in terminal states
  4. Consider dependencies before cancelling pipeline jobs

Profile Management Tools

13. list_profiles

Lists available cluster configuration profiles.

Parameters:

Example:

{
  "tool": "list_profiles",
  "arguments": {
    "category": "production"
  }
}

Response:

{
  "content": [
    {
      "type": "text",
      "text": "Available profiles:\n[\n  {\n    \"id\": \"production/high-memory/analysis\",\n    \"name\": \"High Memory Analysis\",\n    \"category\": \"production\"\n  }\n]"
    }
  ]
}

14. get_profile

Gets details for a specific cluster configuration profile.

Parameters:

Example:

{
  "tool": "get_profile",
  "arguments": {
    "profileId": "development/small"
  }
}

15. list_tracked_clusters

Lists clusters that were created and tracked by this MCP server.

Parameters:

Example:

{
  "tool": "list_tracked_clusters",
  "arguments": {
    "profileId": "production/high-memory/analysis"
  }
}

Monitoring & Utilities

16. get_zeppelin_url

Gets the Zeppelin notebook URL for a cluster (if enabled).

Parameters:

Example:

{
  "tool": "get_zeppelin_url",
  "arguments": {
    "projectId": "my-project-123",
    "region": "us-central1",
    "clusterName": "jupyter-cluster"
  }
}

Response:

{
  "content": [
    {
      "type": "text",
      "text": "Zeppelin URL for cluster jupyter-cluster:\nhttps://jupyter-cluster-m.us-central1-a.c.my-project-123.internal:8080"
    }
  ]
}

17. check_active_jobs

πŸš€ Quick status check for all active and recent jobs with intelligent response optimization.

Parameters:

Response Optimization:

Example (Optimized Response):

{
  "tool": "check_active_jobs",
  "arguments": {
    "includeCompleted": true
  }
}

Optimized Response:

{
  "content": [
    {
      "type": "text",
      "text": "πŸš€ Active Jobs Summary:\n\n▢️  RUNNING (2):\nβ€’ hive-analytics-job (5m ago) - analytics-cluster\nβ€’ spark-etl-pipeline (12m ago) - data-pipeline-cluster\n\nβœ… COMPLETED (3):\nβ€’ daily-report-job (1h ago) - SUCCESS\nβ€’ data-validation (2h ago) - SUCCESS  \nβ€’ backup-process (3h ago) - SUCCESS\n\nπŸ’Ύ Full details: dataproc://responses/jobs/active/ghi789\nπŸ“Š Token reduction: 80.6% (1,626 β†’ 316 tokens)"
    }
  ]
}

Data Structures

Output Formats

type OutputFormat = 'text' | 'json' | 'csv' | 'unknown';

Job Output Options

interface JobOutputOptions extends ParseOptions {
  /**
   * Whether to use cache
   */
  useCache?: boolean;

  /**
   * Whether to validate file hashes
   */
  validateHash?: boolean;

  /**
   * Custom cache config overrides
   */
  cacheConfig?: Partial<CacheConfig>;
}

Parse Options

interface ParseOptions {
  /**
   * Whether to trim whitespace from values
   */
  trim?: boolean;

  /**
   * Custom delimiter for CSV parsing
   */
  delimiter?: string;

  /**
   * Whether to parse numbers in JSON/CSV
   */
  parseNumbers?: boolean;

  /**
   * Whether to skip empty lines
   */
  skipEmpty?: boolean;
}

Table Structure

The table structure used in the formatted output feature:

interface Table {
  /**
   * Array of column names
   */
  columns: string[];
  
  /**
   * Array of row objects, where each object has properties matching column names
   */
  rows: Record<string, any>[];
}

Formatted Output Feature

Overview

The formatted output feature enhances job results by providing a clean, readable ASCII table representation of the data alongside the structured data.

Output Structure

When a job produces tabular output, the result will include:

{
  // Job details...
  parsedOutput: {
    tables: [
      {
        columns: ["column1", "column2", ...],
        rows: [
          { "column1": "value1", "column2": "value2", ... },
          // More rows...
        ]
      },
      // More tables...
    ],
    formattedOutput: "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\nβ”‚ column1 β”‚ column2 β”‚\nβ”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\nβ”‚ value1  β”‚ value2  β”‚\nβ””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜"
  }
}

Usage

To access and display the formatted output:

const results = await getDataprocJobResults({
  projectId: 'your-project',
  region: 'us-central1',
  jobId: 'job-id',
  format: 'text',
  wait: true
});

if (results.parsedOutput && results.parsedOutput.formattedOutput) {
  console.log('Formatted Table Output:');
  console.log(results.parsedOutput.formattedOutput);
}

Multiple Tables

If the job produces multiple tables, they will be formatted separately with table numbers:

Table 1:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ column1 β”‚ column2 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ value1  β”‚ value2  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Table 2:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ column3 β”‚ column4 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ value3  β”‚ value4  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation Details

The formatted output is generated using the table library with specific configuration options for clean formatting:

For more detailed implementation information, see the source code in src/services/output-parser.ts.

Error Handling

The API includes comprehensive error handling for various scenarios:

Each error type includes detailed information to help diagnose and resolve issues.

Best Practices

Working with Formatted Output

  1. Check for existence: Always check if formattedOutput exists before using it
  2. Display as-is: The formatted output is already optimized for console display
  3. Preserve original data: Use the structured data in tables for programmatic processing
  4. Handle large outputs: For very large tables, consider implementing pagination in your UI

Performance Optimization

  1. Use caching: Enable the cache for frequently accessed job results
  2. Specify format: Explicitly specify the expected format when known
  3. Limit wait time: Set appropriate timeouts for waiting operations
  4. Use async mode: For long-running jobs, submit in async mode and check status separately

    Error Handling

Common Error Responses

Invalid Parameters:

{
  "error": {
    "code": "INVALID_PARAMS",
    "message": "Input validation failed: clusterName: Cluster name must start with lowercase letter"
  }
}

Rate Limit Exceeded:

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Try again after 2024-01-01T12:01:00.000Z"
  }
}

Authentication Error:

{
  "error": {
    "code": "AUTHENTICATION_FAILED",
    "message": "Service account authentication failed: Permission denied"
  }
}

Best Practices

  1. Always check job status for long-running operations
  2. Use async mode for jobs that take more than a few minutes
  3. Implement retry logic for transient failures
  4. Clean up resources by deleting clusters when done
  5. Use appropriate cluster sizes for your workload
  6. Monitor costs by tracking cluster usage

Rate Limits

Security Considerations

This API reference provides comprehensive documentation for all tools with practical examples and usage patterns.