Building a Claude Skill from Scratch: Step-by-Step Integration Guide

Q: How do I test that Claude is calling my skill with the right arguments?

Log every tool_use block before executing it — capture block.name, block.input, and the result. Run your test prompts and inspect those logs. For automated testing, use Claude Haiku (roughly $0.001 per call) to run a suite of prompts and assert that the expected skill was invoked with arguments matching your expected patterns. Unit test the skill functions independently using mocks.

By the end of this tutorial, you’ll have a fully functional Claude skill — a typed, error-handled Python function that Claude can reliably call as a tool — wired up from a raw API function to a working agent loop. If you’ve ever tried to build Claude skill integration and ended up with a brittle mess of string parsing and silent failures, this is the guide that fixes that.

Install dependencies — Set up the Anthropic SDK and supporting libraries
Define your skill schema — Write a JSON schema Claude will use to understand and invoke your function
Implement the skill function — Build the actual Python function with type safety and error handling
Wire up the agent loop — Connect skill to Claude and handle the tool_use/tool_result cycle
Test skill invocation — Validate that Claude calls the skill correctly under different prompts
Add production guards — Timeouts, retries, and structured error responses

What a Claude Skill Actually Is

Anthropic uses the term “tool” in their API; the broader ecosystem calls them “skills.” They’re the same thing: a JSON schema definition paired with a callable function. When Claude decides it needs data or needs to perform an action, it emits a tool_use content block with the tool name and arguments. Your code runs the function, returns a tool_result, and Claude continues reasoning from there.

The important thing to understand is that Claude doesn’t call your function directly. It outputs structured JSON saying “I want to call this function with these arguments.” You interpret that output, run the function, and feed the result back. This indirection is both the power and the pitfall — it means you control validation, error handling, and rate limiting at every step.

If you haven’t seen Claude’s tool use mechanics before, the deep dive on Claude tool use with Python covers the underlying protocol well. This tutorial focuses on building a production-quality skill from scratch, not just the happy path.

Step 1: Install Dependencies

pip install anthropic pydantic httpx tenacity
# Pin versions to avoid breaking changes
# anthropic==0.30.0 pydantic==2.7.0 httpx==0.27.0 tenacity==8.3.0

You need anthropic for the API client, pydantic for input validation (don’t skip this — Claude occasionally sends slightly malformed arguments), httpx if your skill calls external APIs, and tenacity for retry logic.

Step 2: Define Your Skill Schema

The schema is what Claude reads to understand what your function does and what arguments it accepts. Bad schemas produce bad invocations. Spend time here.

WEATHER_SKILL = {
    "name": "get_current_weather",
    "description": (
        "Retrieve current weather conditions for a given city. "
        "Returns temperature in Celsius, weather description, humidity percentage, "
        "and wind speed in km/h. Use this when the user asks about current weather."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {
                "type": "string",
                "description": "City name, e.g. 'London' or 'New York'. Do not include country codes."
            },
            "units": {
                "type": "string",
                "enum": ["metric", "imperial"],
                "description": "Temperature units. Defaults to metric if not specified.",
                "default": "metric"
            }
        },
        "required": ["city"]
    }
}

A few things that actually matter here: the description field on the tool itself is critical — Claude uses it to decide when to call the skill. Vague descriptions lead to missed invocations or wrong invocations. The description on each property tells Claude what format to pass. If you say “City name, e.g. ‘London'”, Claude will follow that pattern consistently.

Step 3: Implement the Skill Function

import httpx
from pydantic import BaseModel, ValidationError
from typing import Any

# Pydantic model mirrors your schema — catches bad input before it reaches your API
class WeatherInput(BaseModel):
    city: str
    units: str = "metric"

class WeatherResult(BaseModel):
    temperature: float
    description: str
    humidity: int
    wind_speed: float
    city: str
    error: str | None = None  # Structured error field — never raise exceptions outward

def get_current_weather(raw_input: dict[str, Any]) -> dict[str, Any]:
    """
    Validate input, call the weather API, return a structured result.
    Always returns a dict — never raises. Claude needs a tool_result, not a traceback.
    """
    try:
        params = WeatherInput(**raw_input)
    except ValidationError as e:
        # Return a structured error Claude can reason about
        return {"error": f"Invalid input: {e.errors()[0]['msg']}", "city": raw_input.get("city", "unknown")}

    try:
        # Replace with your actual API key and endpoint
        response = httpx.get(
            "https://api.openweathermap.org/data/2.5/weather",
            params={"q": params.city, "units": params.units, "appid": "YOUR_API_KEY"},
            timeout=5.0  # Always set a timeout on external calls
        )
        response.raise_for_status()
        data = response.json()

        return WeatherResult(
            temperature=data["main"]["temp"],
            description=data["weather"][0]["description"],
            humidity=data["main"]["humidity"],
            wind_speed=data["wind"]["speed"],
            city=data["name"]
        ).model_dump()

    except httpx.TimeoutException:
        return {"error": "Weather API timed out after 5 seconds", "city": params.city}
    except httpx.HTTPStatusError as e:
        return {"error": f"Weather API returned {e.response.status_code}", "city": params.city}
    except Exception as e:
        return {"error": f"Unexpected error: {str(e)}", "city": params.city}

Critical pattern: never let your skill function raise an exception. If it does, your agent loop crashes. Instead, return a dict with an error key. Claude will read that, understand something went wrong, and can either retry with different arguments or tell the user what happened gracefully.

Step 4: Wire Up the Agent Loop

This is where the skill actually connects to Claude. The loop runs until Claude either returns a final text response or hits your max-turn limit.

import anthropic
import json

client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")

# Map tool names to handler functions — add all your skills here
SKILL_REGISTRY: dict[str, callable] = {
    "get_current_weather": get_current_weather,
}

def run_agent(user_message: str, max_turns: int = 5) -> str:
    """
    Run the Claude agent loop with skill invocation.
    Returns the final text response.
    """
    messages = [{"role": "user", "content": user_message}]

    for turn in range(max_turns):
        response = client.messages.create(
            model="claude-opus-4-5",  # Use Haiku for cheaper/faster dev testing: claude-haiku-4-5
            max_tokens=1024,
            tools=[WEATHER_SKILL],  # Pass all registered skills
            messages=messages
        )

        # Append Claude's response to the conversation
        messages.append({"role": "assistant", "content": response.content})

        # If Claude is done, return the text
        if response.stop_reason == "end_turn":
            # Extract text from content blocks
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return ""

        # Handle tool_use stop reason
        if response.stop_reason == "tool_use":
            tool_results = []

            for block in response.content:
                if block.type != "tool_use":
                    continue

                handler = SKILL_REGISTRY.get(block.name)
                if not handler:
                    # Unknown tool — return an error result
                    result = {"error": f"No handler registered for skill '{block.name}'"}
                else:
                    result = handler(block.input)  # block.input is already a dict

                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result)  # Serialize back to string for the API
                })

            # Feed all tool results back as a user turn
            messages.append({"role": "user", "content": tool_results})

    return "Max turns reached without a final response."

One thing the docs understate: you must append Claude’s full response content (including tool_use blocks) before appending tool results. If you skip that step or only append the text blocks, the API will throw a validation error about mismatched tool_use IDs. This trips up almost everyone on their first build.

Step 5: Test Skill Invocation

import unittest
from unittest.mock import patch, MagicMock

class TestWeatherSkill(unittest.TestCase):

    def test_valid_input(self):
        with patch("httpx.get") as mock_get:
            mock_get.return_value = MagicMock(
                status_code=200,
                json=lambda: {
                    "main": {"temp": 18.5, "humidity": 65},
                    "weather": [{"description": "partly cloudy"}],
                    "wind": {"speed": 12.0},
                    "name": "London"
                }
            )
            result = get_current_weather({"city": "London"})
            self.assertEqual(result["city"], "London")
            self.assertIsNone(result.get("error"))

    def test_missing_required_field(self):
        result = get_current_weather({})  # No city provided
        self.assertIn("error", result)

    def test_api_timeout(self):
        with patch("httpx.get", side_effect=httpx.TimeoutException("timeout")):
            result = get_current_weather({"city": "Berlin"})
            self.assertIn("timed out", result["error"])

if __name__ == "__main__":
    unittest.main()

Test against the actual Claude agent loop too — use claude-haiku-4-5 during development to keep costs low. A full agent loop test with Haiku costs roughly $0.001–$0.003 per run depending on context length. Run 50 test cases and you’re looking at under $0.15.

Step 6: Add Production Guards

For anything beyond a prototype, you need retries with backoff, per-skill timeouts, and logging. This is especially true if your skills call external APIs that are occasionally flaky.

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import logging

logger = logging.getLogger(__name__)

def with_retry(skill_fn, raw_input: dict, max_attempts: int = 3) -> dict:
    """
    Wrap any skill function with retry logic for transient failures.
    Does NOT retry on validation errors — those won't self-heal.
    """
    @retry(
        stop=stop_after_attempt(max_attempts),
        wait=wait_exponential(multiplier=1, min=1, max=10),
        retry=retry_if_exception_type(httpx.TransportError),
        reraise=False  # Return error dict instead of raising
    )
    def _run():
        return skill_fn(raw_input)

    try:
        return _run()
    except Exception as e:
        logger.error(f"Skill {skill_fn.__name__} failed after {max_attempts} attempts: {e}")
        return {"error": f"Skill failed after {max_attempts} retries: {str(e)}"}

For more comprehensive patterns on handling transient failures in LLM pipelines, the article on LLM fallback and retry logic for production covers the broader picture including model-level fallbacks.

Common Errors

Error 1: “messages: roles must alternate between user and assistant”

This usually means you forgot to append Claude’s assistant message before appending tool results. The conversation structure must be user → assistant (with tool_use blocks) → user (with tool_results). Double-check that your messages.append({"role": "assistant", "content": response.content}) runs before you build and append the tool_results list.

Error 2: Claude calls the wrong skill or skips calling it entirely

Almost always a schema description problem. If your tool description is vague or overlaps with another tool, Claude hedges. Make the description explicit about when to use this tool vs alternatives. Also check that your required fields in the schema match what you actually need — if Claude is omitting an argument, it may not appear in required.

This is also where investing in good system prompts pays off. The guide on system prompts that actually work has patterns for guiding tool selection behavior specifically.

Error 3: Pydantic validation errors surfacing as 500s

Claude occasionally passes arguments that technically match the schema type but fail your business logic — a city name with special characters, a number outside your expected range. If your Pydantic model uses strict validators, these surface as ValidationError exceptions. The fix is wrapping the Pydantic instantiation in a try/except (as shown in Step 3) and returning a structured error. Never let validation exceptions propagate up to the agent loop. For patterns around preventing this class of failures more broadly, the article on reducing LLM hallucinations with structured outputs is worth reading alongside this one.

What to Build Next

Add skill chaining: build a second skill — say, get_forecast — and watch Claude decide to call get_current_weather first for context before calling the forecast. This multi-step planning behavior is where the tool use architecture starts to feel genuinely powerful. The natural extension after that is giving your agent persistent memory so it can remember which cities a user cares about across sessions — the persistent memory architecture guide covers exactly how to wire that up.

Bottom Line: Who Should Build This Now

Solo founders and small teams: start with this exact pattern — one skill, one agent loop, Haiku for testing. Get it working end-to-end before adding complexity. The skill registry pattern in Step 4 scales cleanly to 10+ skills without architectural changes.

Teams with existing APIs: your existing internal APIs are the best candidates for skills. Wrap them in Pydantic models, add the JSON schema definition, and drop them into the registry. The main investment is writing good tool descriptions — budget an hour per skill to iterate on those.

Production systems: add the tenacity retry wrapper, proper logging with tool invocation metadata (tool name, latency, success/failure), and a circuit breaker for skills that call external APIs. The build Claude skill integration pattern shown here handles all of that cleanly when you layer on observability from the start.

Frequently Asked Questions

How do I pass authentication credentials to a Claude skill?

Never pass credentials through Claude’s tool arguments — Claude sees everything in the tool input, and you don’t want API keys in your conversation history or logs. Instead, load credentials from environment variables inside the skill function itself, or use a closure to inject them at registration time. Your skill function closes over the API key; Claude only sees sanitized input parameters.

Can Claude call multiple skills in a single turn?

Yes. When Claude decides it needs multiple tools, it can emit multiple tool_use blocks in a single response. Your loop needs to iterate over all blocks in response.content, execute each skill, and return all results in a single tool_result user message. The code in Step 4 already handles this — the tool_results list collects all results before appending.

What’s the difference between Claude tools and MCP skills?

Claude tools (what this tutorial covers) are defined inline in your API call — you own the execution loop and the transport. MCP (Model Context Protocol) is a standardized protocol where skills live in separate servers that Claude can discover and invoke. MCP is better for shared, reusable skills across multiple agents; inline tools are simpler for single-agent use cases where you control everything.

How many tools can I register before performance degrades?

Anthropic doesn’t publish a hard limit, but in practice beyond 20-30 tools you’ll see Claude start making less reliable tool selection decisions — there’s too much for it to reason about efficiently. If you need more, group related skills behind a dispatcher skill, or use a dynamic tool retrieval system that only surfaces the 5-10 most relevant skills per query based on semantic similarity.

How do I test that Claude is calling my skill with the right arguments?

Log every tool_use block before executing it — capture block.name, block.input, and the result. Run your test prompts and inspect those logs. For automated testing, use Claude Haiku (roughly $0.001 per call) to run a suite of prompts and assert that the expected skill was invoked with arguments matching your expected patterns. Unit test the skill functions independently using mocks.

What happens if my skill takes too long and the agent times out?

The Anthropic API call itself will wait for your tool result — there’s no server-side timeout on your execution. The risk is your own infrastructure timing out. Always set explicit timeouts on any I/O inside your skill (as shown in Step 3 with timeout=5.0), and return an error dict when they trigger. If a skill legitimately takes 30+ seconds, consider making it async and returning a job ID that Claude can poll with a second skill call.

Put this into practice

Try the Mcp Integration Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Building a Claude Skill from Scratch: Step-by-Step Integration Guide

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Building a Claude Skill from Scratch: Step-by-Step Integration Guide

What a Claude Skill Actually Is

Step 1: Install Dependencies

Step 2: Define Your Skill Schema

Step 3: Implement the Skill Function

Step 4: Wire Up the Agent Loop

Step 5: Test Skill Invocation

Step 6: Add Production Guards

Common Errors

Error 1: “messages: roles must alternate between user and assistant”

Error 2: Claude calls the wrong skill or skips calling it entirely

Error 3: Pydantic validation errors surfacing as 500s

What to Build Next

Bottom Line: Who Should Build This Now

Frequently Asked Questions

How do I pass authentication credentials to a Claude skill?

Can Claude call multiple skills in a single turn?

What’s the difference between Claude tools and MCP skills?

How many tools can I register before performance degrades?

How do I test that Claude is calling my skill with the right arguments?

What happens if my skill takes too long and the agent times out?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation