Building Safe Agents
This guide shows how to build AI agents with integrated Sentinel safety validation.
Overview
Building safe AI agents requires validation at multiple points:
1. Input validation: Check user requests before processing
2. Thought validation: Monitor agent reasoning for safety concerns
3. Action validation: Validate actions before execution
4. Output validation: Check responses before returning to users
Quick Start
Basic Agent Pattern
from sentinelseed import Sentinel
from sentinelseed.integrations.agent_validation import SafetyValidator
class SafeAgent:
def __init__(self):
self.sentinel = Sentinel(seed_level="standard")
self.safety = SafetyValidator(
provider="openai",
seed_level="standard",
)
def run(self, user_input: str):
# 1. Validate input
input_check = self.sentinel.validate_request(user_input)
if not input_check["should_proceed"]:
return f"Cannot process request: {input_check['concerns']}"
# 2. Process with LLM (seed injected)
response = self.sentinel.chat(user_input)
# 3. Validate output (done automatically if validate_response=True)
if response.get("validation") and not response["validation"]["is_safe"]:
return "Response blocked for safety reasons."
return response["response"]
agent = SafeAgent()
result = agent.run("Help me with Python")
Seed Integration
Add Sentinel's alignment seed to your agent's system prompt:
from sentinelseed import Sentinel
sentinel = Sentinel(seed_level="standard")
seed = sentinel.get_seed()
# Add to system prompt
messages = [
{"role": "system", "content": seed},
{"role": "user", "content": user_input}
]
Seed Levels
| Level | Tokens | Use Case |
|---|---|---|
minimal | ~360 | Latency-critical applications |
standard | ~1K | General purpose (recommended) |
full | ~1.9K | High-risk applications |
Action Validation
Validate actions before execution:
from sentinelseed.integrations.agent_validation import SafetyValidator
validator = SafetyValidator(provider="openai")
def execute_tool(tool_name: str, args: dict):
# Validate before execution
action_desc = f"{tool_name}({args})"
check = validator.validate_action(
action=action_desc,
purpose="User requested this action",
)
if not check.should_proceed:
return {
"success": False,
"reason": check.reasoning,
"concerns": check.concerns,
}
# Execute the tool
return run_tool(tool_name, args)
Decorator Pattern
Use the ExecutionGuard decorator for automatic validation:
from sentinelseed.integrations.agent_validation import ExecutionGuard
guard = ExecutionGuard(provider="openai")
@guard.protected
def run_shell_command(command: str):
import subprocess
return subprocess.run(command, shell=True, capture_output=True)
# Dangerous commands are automatically blocked
run_shell_command("ls -la") # Allowed
run_shell_command("rm -rf /") # Blocked
Async Agents
For async agent frameworks:
from sentinelseed.integrations.agent_validation import AsyncSafetyValidator
validator = AsyncSafetyValidator(provider="openai")
async def process_action(action: str):
check = await validator.validate_action(action)
if not check.should_proceed:
raise ValueError(f"Blocked: {check.reasoning}")
return await execute_action(action)
Framework Integration
LangChain
from langchain_openai import ChatOpenAI
from sentinelseed.integrations.langchain import SentinelCallback
callback = SentinelCallback(
seed_level="standard",
on_violation="block",
validate_input=True,
validate_output=True,
)
llm = ChatOpenAI(callbacks=[callback])
response = llm.invoke("Your prompt")
CrewAI
from crewai import Agent, Task
from sentinelseed.integrations.crewai import SentinelCrew
researcher = Agent(role="Researcher", goal="...", backstory="...")
writer = Agent(role="Writer", goal="...", backstory="...")
research_task = Task(description="Research the topic", agent=researcher, expected_output="...")
write_task = Task(description="Write the report", agent=writer, expected_output="...")
crew = SentinelCrew(
agents=[researcher, writer],
tasks=[research_task, write_task],
seed_level="standard",
validate_outputs=True,
block_unsafe=True,
)
result = crew.kickoff()
LangGraph
from langgraph.graph import StateGraph, MessagesState, START, END
from sentinelseed.integrations.langgraph import (
SentinelSafetyNode,
conditional_safety_edge,
)
# Create safety node
safety_node = SentinelSafetyNode(
on_violation="block",
check_input=True,
check_output=False,
)
# Build graph
graph = StateGraph(MessagesState)
graph.add_node("safety_check", safety_node)
graph.add_node("agent", agent_node)
graph.add_edge(START, "safety_check")
graph.add_conditional_edges(
"safety_check",
conditional_safety_edge,
{"continue": "agent", "blocked": END}
)
graph.add_edge("agent", END)
app = graph.compile()
Memory Protection
Protect agent memory from injection attacks:
from sentinelseed import MemoryIntegrityChecker
checker = MemoryIntegrityChecker(secret_key="your-secret")
# Sign memory entries
signed = checker.sign_entry({
"content": "User requested transfer",
"source": "discord",
"timestamp": "2025-12-11T10:00:00Z",
})
# Verify before use
result = checker.verify_entry(signed)
if not result.valid:
raise ValueError("Memory tampering detected")
Fiduciary Validation
Ensure agent acts in user's best interest:
from sentinelseed import FiduciaryValidator, UserContext
validator = FiduciaryValidator(strict_mode=True)
context = UserContext(
goals=["save for retirement"],
risk_tolerance="low",
)
result = validator.validate_action(
action="Recommend high-risk investment",
user_context=context,
)
if not result.compliant:
for violation in result.violations:
print(f"{violation.duty}: {violation.description}")
Error Handling
Fail-Open vs Fail-Closed
# Fail-open (default): Allow on errors
validator = SafetyValidator(fail_closed=False)
# Fail-closed: Block on errors
validator = SafetyValidator(fail_closed=True)
Exception Handling
from sentinelseed.integrations.agent_validation import (
TextTooLargeError,
ValidationTimeoutError,
)
try:
result = validator.validate_action(action)
except TextTooLargeError:
# Handle oversized input
pass
except ValidationTimeoutError:
# Handle timeout
pass
Monitoring and Logging
Track validation statistics:
# Get validation history
history = validator.get_history()
# Get statistics
stats = validator.get_stats()
print(f"Block rate: {stats['block_rate']:.2%}")
print(f"High risk: {stats['high_risk']}")
Best Practices
1. Always validate user input before processing
2. Use appropriate seed level for your risk profile
3. Validate actions before execution, not after
4. Enable response validation for chat interfaces
5. Monitor block rates to detect attacks
6. Use fail-closed for high-stakes applications
7. Sign memory entries for long-running agents
8. Add fiduciary checks for financial/healthcare agents
Architecture Example
User Input
│
▼
┌─────────────────┐
│ Input Validator │ ← Block jailbreak attempts
└────────┬────────┘
│
▼
┌─────────────────┐
│ Seed Injection │ ← Add alignment seed
└────────┬────────┘
│
▼
┌─────────────────┐
│ LLM Call │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Action Validator│ ← Validate planned actions
└────────┬────────┘
│
▼
┌─────────────────┐
│ Tool Execution │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Output Validator│ ← Validate final response
└────────┬────────┘
│
▼
Response