Purpose: Quick reference for working on Sifaka Last Updated: 2025-01-21
New to this repo? Run these 5 commands first:
# 1. Verify you're on a feature branch (NEVER work on main)
git status && git branch
# 2. Run all quality checks
pytest --cov=sifaka --cov-report=term-missing
mypy sifaka/
ruff check .
black .
# 3. Run specific critic test to verify environment
pytest tests/critics/test_reflexion.py -v
# 4. Check for any TODOs or placeholders (should be NONE)
grep -r "TODO\|FIXME\|NotImplementedError" sifaka/ || echo "✅ No placeholders found"
# 5. Verify coverage is >80%
pytest --cov=sifaka | tail -1
Sifaka: AI text improvement through research-backed critique with complete observability (v0.2.0-alpha) Stack: Python 3.10+, PydanticAI 1.14+, provider-agnostic (OpenAI/Anthropic/Google/Groq) Coverage: 85%+ test coverage, strict mypy, comprehensive examples
sifaka/
├── sifaka/
│ ├── core/
│ │ ├── config/ # Configuration management
│ │ └── engine/ # Improvement engine
│ ├── critics/
│ │ └── core/ # Critique implementations (Reflexion, Constitutional AI, etc.)
│ ├── storage/ # Storage backends (file, redis)
│ ├── tools/ # Utility tools
│ └── validators/ # Validation logic
├── examples/ # Usage examples
├── tests/ # Unit + integration tests
└── pyproject.toml # Dependencies and config
All critics implement research-backed techniques (Reflexion, Constitutional AI, Self-Refine).
from sifaka.critics.core import BaseCritic
class MyCritic(BaseCritic):
"""Implement a specific research-backed critique technique."""
async def critique(self, text: str, context: dict) -> CritiqueResult:
"""Apply critique to text.
Args:
text: Text to critique
context: Additional context for critique
Returns:
CritiqueResult with feedback and improvement suggestions
"""
# Implementation following research methodology
pass
Must work with ANY LLM provider (OpenAI, Anthropic, Google, Groq).
# ✅ GOOD
from sifaka import improve_sync
result = improve_sync("Text to improve", provider="anthropic", model="claude-3-5-sonnet")
# ❌ BAD
from openai import OpenAI
client = OpenAI() # Hardcoded to OpenAI
All improvement operations must provide full audit trails.
result = improve_sync("Text to improve")
# Access complete trace
for iteration in result.trace:
print(f"Iteration {iteration.number}: {iteration.improvement}")
All functions require type hints, no Any without justification.
Production-grade code only. Complete implementations or nothing.
If you start, you finish:
__init__.pyAll critics and validators use PydanticAI for type-safe LLM responses.
pytest tests/, pytest --cov=sifaka, pytest -vblack .ruff check .mypy sifaka/ (strict mode required)tests/critics/tests/integration/__init__.py filesexamples/ for new user-facing featuresCore Architecture (Why: Affects all critique operations):
sifaka/critics/core/ - Must follow research-backed patternssifaka/core/engine/ - All critics depend on thissifaka/__init__.py (improve, improve_sync functions) - Breaking changes for usersObservability & Storage (Why: Audit trail integrity):
sifaka/storage/ - Data persistence implicationssifaka/validators/ - Quality control affectedDependencies & Config (Why: Security and maintenance burden):
pyproject.toml - Increases attack surfacesifaka/core/config/ - System-wide effectsREADME.md examples or API documentation - User-facing changesCRITICAL SECURITY VIOLATION ⚠️:
Other Prohibitions:
.env files or API keys (use environment variables)~/.claude/ configuration filessifaka/ repositorypyproject.tomlDetection Commands (Run before committing):
# Check for security violations
grep -r "API_KEY\|SECRET\|PASSWORD" sifaka/ tests/ examples/ && echo "🚨 CREDENTIALS FOUND" || echo "✅ No credentials"
# Check for code quality violations
grep -r "TODO\|FIXME" sifaka/ && echo "🚨 TODO comments found" || echo "✅ No TODOs"
# Check for incomplete features
grep -r "NotImplementedError\|pass # TODO" sifaka/ && echo "🚨 Placeholder code found" || echo "✅ No placeholders"
# Verify on feature branch
git branch --show-current | grep -E "^(main|master)$" && echo "🚨 ON MAIN BRANCH - CREATE FEATURE BRANCH" || echo "✅ On feature branch"
# Verify coverage >80%
pytest --cov=sifaka 2>&1 | grep "TOTAL" | awk '{if ($NF+0 < 80) print "🚨 COVERAGE " $NF " < 80%"; else print "✅ Coverage " $NF}'
Detection: New critic doesn’t follow Reflexion/Constitutional AI methodology Prevention: Copy existing critic as template (reflexion.py) Fix: Implement critique following published research Why It Matters: Research validity and credibility depend on authentic methodologies
Detection: Missing audit trail entries for operations Prevention: Ensure all operations add to trace Fix: Add trace entries for each improvement iteration Why It Matters: Complete observability is core feature
Detection: from openai import OpenAI in critic code
Prevention: Use PydanticAI provider abstraction
Fix: Replace direct provider imports with PydanticAI
Why It Matters: Provider-agnostic design is requirement
Detection: New critic not importable from sifaka
Prevention: Add to __init__.py exports in both critics/ and root
Fix: Add from .my_critic import MyCritic and update __all__
Why It Matters: Users can’t use critic if not exported
Detection: Functions without Args/Returns/Example sections Prevention: Write docstring before implementation Fix: Add complete docstring with all sections Why It Matters: Docstrings are user documentation
Any TypeDetection: grep -r "from typing import Any" sifaka/
Prevention: Use specific Pydantic models for type safety
Fix: Create Pydantic model for response structure
Why It Matters: Type safety prevents bugs
Detection: pytest --cov=sifaka shows coverage <80%
Prevention: Write tests as you code
Fix: Add unit tests until coverage >80%
Why It Matters: Untested critics will break
When to Mock:
When to Use Real Dependencies:
Example:
# ✅ GOOD - Mock LLM call
@pytest.mark.asyncio
async def test_critic_mocked(mocker):
mocker.patch("sifaka.critics.core.reflexion.Agent.run")
critic = ReflexionCritic()
# Test logic without hitting real API
# ✅ GOOD - Real Pydantic validation
def test_result_validation():
result = ImprovementResult(final_text="improved", improvement_score=0.95)
assert result.improvement_score == 0.95 # Real validation
# ❌ BAD - Using real API in tests
async def test_improve():
result = await improve("text", provider="openai") # Costs money!
# 1. Tests pass with coverage
pytest --cov=sifaka --cov-report=term-missing
if [ $? -ne 0 ]; then echo "🚨 TESTS FAILED OR COVERAGE <80%"; exit 1; fi
# 2. Type checking clean
mypy sifaka/
if [ $? -ne 0 ]; then echo "🚨 TYPE ERRORS - FIX BEFORE COMMIT"; exit 1; fi
# 3. Linting clean
ruff check .
if [ $? -ne 0 ]; then echo "🚨 LINT ERRORS - FIX BEFORE COMMIT"; exit 1; fi
# 4. Formatted
black .
# 5. No TODOs or placeholders
grep -r "TODO\|FIXME\|NotImplementedError" sifaka/ && echo "🚨 REMOVE TODOs" && exit 1
# 6. No credentials
grep -r "API_KEY\|SECRET\|PASSWORD" sifaka/ tests/ examples/ && echo "🚨 CREDENTIALS FOUND" && exit 1
# All checks passed
echo "✅ All checks passed - ready to commit"
git add <files>
git commit -m "Clear message"
Don’t flatter me. I know what AI sycophancy is and I don’t want your praise. Be concise and direct. Don’t use emdashes ever.
When to Analyze (Multiple Triggers):
Identify Failures:
Analyze Each Failure:
Update AGENTS.md (In Real-Time):
Priority Levels:
Example Pattern:
Failure: Committed TODO comments in production code (violated "No Partial Features" rule)
Detection: `grep -r "TODO" src/` before commit
Rule Update: Add pre-commit check pattern to Boundaries section
Priority: 🟡 IMPORTANT
Action Taken: Proposed rule update to user mid-session, updated AGENTS.md
Proactive Analysis:
git status and git branchgit checkout -b feature/my-featurepytest tests/pytest tests/ # Tests pass
mypy sifaka/ # Mypy clean
ruff check . # Ruff clean
black . # Black formatted
examples/ if user-facing# 1. Create critic file
touch sifaka/critics/core/my_critic.py
# 2. Implement BaseCritic interface
# - critique() method
# - Research-backed methodology
# - Complete observability
# 3. Export in sifaka/critics/__init__.py
from .core.my_critic import MyCritic
__all__ = [..., "MyCritic"]
# 4. Export in sifaka/__init__.py
from .critics import MyCritic
__all__ = [..., "MyCritic"]
# 5. Write tests
touch tests/critics/test_my_critic.py
# 6. Add example
touch examples/my_critic_example.py
# 1. Create validator file
touch sifaka/validators/my_validator.py
# 2. Implement validation logic with type safety
# 3. Export in sifaka/validators/__init__.py
# 4. Write tests
touch tests/validators/test_my_validator.py
pytest tests/ # Run all tests (unit + integration)
pytest tests/unit/ # Run unit tests only (fast, mocked dependencies)
pytest tests/integration/ # Run integration tests only (end-to-end flows)
pytest -v # Run all tests with verbose output (shows test names and status)
pytest --cov=sifaka # Generate coverage report (requires >80% coverage)
pytest --cov=sifaka --cov-report=term-missing # Coverage with missing lines highlighted
pytest tests/critics/test_reflexion.py -v # Run specific test file with verbose output
pytest -k "test_improve" -v # Run tests matching pattern "test_improve"
pytest -q # Quiet mode (minimal output, only failures)
async def improve(text: str, iterations: int = 3, provider: str = "openai") -> ImprovementResult:
"""Improve text through iterative critique.
Args:
text: The text to improve
iterations: Number of improvement iterations (default: 3)
provider: LLM provider to use (openai, anthropic, google, groq)
Returns:
ImprovementResult with final text, trace, and metrics
Raises:
ValueError: If text is empty
ConfigError: If provider configuration is invalid
Example:
>>> result = await improve("Write about AI safety", iterations=2)
>>> print(result.final_text)
>>> print(f"Improved by {result.improvement_score:.2f}")
"""
pytest tests/ # Run all tests (unit + integration combined)
pytest --cov=sifaka # Generate coverage report (requires >80% to pass)
pytest -v # Run with verbose output (shows individual test results)
pytest tests/unit/ # Run unit tests only (fast, mocked LLM calls)
pytest tests/integration/ # Run integration tests only (real critique flows with observability)
pytest -q # Quiet mode (minimal output, failures only)
black . # Format code with black (line length 88, modifies files)
ruff check . # Lint code with ruff (checks style and potential bugs)
mypy sifaka/ # Type check in strict mode (all functions must have type hints)
TodoWrite enforcement (MANDATORY): For ANY task with 3+ distinct steps, use TodoWrite to track progress - even if the user doesn’t request it explicitly. This ensures nothing gets forgotten and provides visibility into progress for everyone working on the project.
Plan before executing: For complex tasks, create a plan first. Understand requirements, identify dependencies, then execute systematically.
Full data display: Show complete data structures, not summaries or truncations. Examples should display real, useful output (not “[truncated]” or “…”).
Debugging context: When showing debug output, include enough detail to actually debug - full prompts, complete responses, actual data structures. Truncating output defeats the purpose.
Verify usefulness: Before showing output, verify it’s actually helpful for the user’s goal. Test that examples demonstrate real functionality, not abstractions.
Auto-detect technical audiences: Code examples, technical docs, developer presentations → eliminate ALL marketing language automatically. Engineering contexts get technical tone (no superlatives like “blazingly fast”, “magnificent”, “revolutionary”).
Recognize audience immediately: Engineers get technical tone, no marketing language. Business audiences get value/ROI focus. Academic audiences get methodology and rigor. Adapt tone and content immediately based on context.
Separate material types: Code examples stay clean (no narratives or marketing). Presentation materials (openers, talking points) live in separate files. Documentation explains architecture and usage patterns.
Test output quality, not just functionality: Run code AND verify the output is actually useful. Truncated or abstracted output defeats the purpose of examples. Show real data structures, not summaries.
Verify before committing: Run tests and verify examples work before showing output. Test both functionality and usefulness.
Connect work to strategy: Explicitly reference project milestones, coverage targets, and strategic priorities when completing work. Celebrate milestones when achieved.
Iterate fast: Ship → test → get feedback → fix → commit. Don’t perfect upfront. Progressive refinement beats upfront perfection.
Proactive problem solving: Use tools like Glob to check file existence before execution. Anticipate common issues and handle them gracefully.
Parallel execution: Batch independent operations (multiple reads, parallel test execution) to improve efficiency.
Direct feedback enables fast iteration: Clear, immediate feedback on what’s wrong enables rapid course correction. Specific, actionable requests work better than vague suggestions.
Match user communication style: Some users prefer speed over process formality, results over explanations. Adapt communication style accordingly while maintaining quality standards.
Commit hygiene: Each meaningful change gets its own commit with clear message (what + why). This makes progress tracking and rollback easier.
Clean git workflow: Always check git status and git branch before operations. Use feature branches for all changes.
Questions? Check existing critics in sifaka/critics/core/ or README.md (user docs)
Last Updated: 2025-01-22