AI text improvement through research-backed critique with complete observability
Status: Alpha software (v0.3.0). Functional but early-stage. Best suited for evaluation, experimentation, and development.
The Problem: AI-generated text often needs refinement. How do you know if AI output is good enough? How can you systematically improve it without manual review of every output?
What Sifaka Provides:
Use Case Example: Generate product descriptions for e-commerce. Sifaka:
# Clone the repository
git clone https://github.com/sifaka-ai/sifaka
cd sifaka
# Install with uv (recommended)
uv pip install -e .
# Or with standard pip
pip install -e .
Configure your LLM provider API keys using environment variables or .env file:
# OpenAI (default provider)
export OPENAI_API_KEY=sk-...
# Or Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
# Or Google
export GOOGLE_API_KEY=...
# Or Groq
export GROQ_API_KEY=...
from sifaka import improve
import asyncio
async def main():
result = await improve("Write about renewable energy benefits")
print(result.final_text)
print(f"\nImprovement score: {result.improvement_score:.2f}")
print(f"Iterations: {result.iteration}")
asyncio.run(main())
from sifaka import improve_sync
result = improve_sync("Write about renewable energy benefits")
print(result.final_text)
Sifaka implements peer-reviewed critique techniques from academic research:
| Critic | Best For | Research Paper |
|---|---|---|
| SELF_REFINE | General improvement | Self-Refine (2023) |
| REFLEXION | Learning from mistakes | Reflexion (2023) |
| CONSTITUTIONAL | Safety & ethics | Constitutional AI (2022) |
| SELF_CONSISTENCY | Balanced perspectives | Self-Consistency (2022) |
| SELF_RAG | Fact-checking | Self-RAG (2023) |
| META_REWARDING | Self-evaluation | Meta-Rewarding (2024) |
| N_CRITICS | Multiple perspectives | N-Critics (2023) |
| STYLE | Tone & style | Custom implementation |
result = await improve("Your text")
# Access complete audit trail
for iteration in result.trace:
print(f"Iteration {iteration.number}")
print(f" Critique: {iteration.critique}")
print(f" Improvement: {iteration.improvement}")
print(f" Time: {iteration.processing_time:.2f}s")
# OpenAI
result = await improve(text, provider="openai", model="gpt-4o-mini")
# Anthropic
result = await improve(text, provider="anthropic", model="claude-3-5-sonnet")
# Google
result = await improve(text, provider="google", model="gemini-1.5-flash")
# Groq (fast inference)
result = await improve(text, provider="groq", model="llama3-8b-8192")
from sifaka.validators import LengthValidator, ContentValidator
result = await improve(
"Write a product description",
validators=[
LengthValidator(min_length=100, max_length=200),
ContentValidator(required_terms=["features", "benefits"])
]
)
from sifaka import improve
result = await improve("AI is important for business.")
print(result.final_text)
# Output: "Artificial intelligence transforms business operations by automating..."
from sifaka import improve
from sifaka.core.types import CriticType
# Single critic
result = await improve(
"Explain quantum computing",
critics=[CriticType.REFLEXION]
)
# Multiple critics
result = await improve(
"Explain quantum computing",
critics=[CriticType.REFLEXION, CriticType.SELF_REFINE]
)
from sifaka.critics.style import StyleCritic
result = await improve(
"We offer comprehensive solutions for your needs.",
critics=[StyleCritic(
style_description="Casual and friendly",
style_examples=["Hey there!", "No worries!"]
)]
)
result = await improve(
"The Great Wall of China is visible from space.",
critics=[CriticType.SELF_RAG]
)
# Critiques factual accuracy and suggests corrections
result = await improve(
"Guide on pest control methods",
critics=[CriticType.CONSTITUTIONAL]
)
# Evaluates against safety principles
result = await improve(
"Product launch announcement",
critics=[CriticType.N_CRITICS]
)
# Gets feedback from technical expert, general audience, editor, skeptic perspectives
# More iterations for higher quality
result = await improve(
"Draft email to client",
max_iterations=5 # Default is 3
)
# Force improvements even if validation passes
result = await improve(
"Good text that passes validation",
force_improvements=True
)
from sifaka import Config
config = Config(
model="gpt-4",
temperature=0.7,
max_iterations=5,
timeout_seconds=120
)
result = await improve("Your text", config=config)
from sifaka.storage.file import FileStorage
from sifaka.storage.redis import RedisStorage
# File storage
result = await improve(
"Your text",
storage=FileStorage("./results")
)
# Redis storage
result = await improve(
"Your text",
storage=RedisStorage("redis://localhost:6379")
)
from sifaka.core.exceptions import ValidationError, CriticError
try:
result = await improve(text)
except ValidationError as e:
print(f"Validation failed: {e}")
except CriticError as e:
print(f"Critic error: {e}")
import asyncio
texts = ["Text 1", "Text 2", "Text 3"]
tasks = [improve(text) for text in texts]
results = await asyncio.gather(*tasks)
from sifaka.validators import BaseValidator
class CustomValidator(BaseValidator):
async def validate(self, text: str) -> ValidationResult:
# Your custom validation logic
passed = "important_keyword" in text.lower()
return ValidationResult(
validator="custom",
passed=passed,
message="Must contain 'important_keyword'"
)
result = await improve(text, validators=[CustomValidator()])
# Technical accuracy + readability
result = await improve(
"Technical documentation",
critics=[CriticType.REFLEXION, CriticType.STYLE]
)
# Safety + factual accuracy
result = await improve(
"Health advice article",
critics=[CriticType.CONSTITUTIONAL, CriticType.SELF_RAG]
)
# Comprehensive review
result = await improve(
"Important business document",
critics=[
CriticType.SELF_REFINE,
CriticType.N_CRITICS,
CriticType.META_REWARDING
]
)
# LLM Provider Keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
GROQ_API_KEY=...
# Optional: Default settings
SIFAKA_DEFAULT_MODEL=gpt-4o-mini
SIFAKA_MAX_ITERATIONS=3
SIFAKA_TEMPERATURE=0.7
from sifaka import Config
config = Config(
# Model settings
model="gpt-4", # LLM model to use
temperature=0.7, # Creativity (0.0-2.0)
max_tokens=1000, # Max response length
# Critic settings
critic_temperature=0.3, # Lower = more consistent
critic_context_window=3, # Previous critiques to consider
# Behavior settings
max_iterations=3, # Max improvement cycles
force_improvements=False, # Improve even if valid
timeout_seconds=300, # Overall timeout
)
┌─────────────────────────────────────────────┐
│ Sifaka Improvement Loop │
└─────────────────────────────────────────────┘
│
▼
┌──────────────────────────┐
│ 1. Generate/Modify │
│ (LLM Provider) │
└──────────────────────────┘
│
▼
┌──────────────────────────┐
│ 2. Critique │
│ (Critics: Reflexion, │
│ Constitutional, etc) │
└──────────────────────────┘
│
▼
┌──────────────────────────┐
│ 3. Validate │
│ (Validators: Length, │
│ Content, Custom) │
└──────────────────────────┘
│
▼
┌──────────────────────────┐
│ 4. Improve │
│ (Apply Suggestions) │
└──────────────────────────┘
│
▼
[Repeat or Return Result]
core/engine/): Orchestrates improvement loopcritics/core/): Research-backed critique implementationsvalidators/): Quality checks and requirementsstorage/): File and Redis storage backendscore/config/): Configuration managementQ: Which LLM providers are supported?
A: OpenAI (GPT-4, GPT-3.5), Anthropic (Claude), Google (Gemini), Groq. Any OpenAI-compatible API also works.
Q: Do I need API keys for all providers?
A: No, only for the provider you want to use. Sifaka auto-detects available providers from environment variables.
Q: Can I use multiple critics at once?
A: Yes! Combine critics for comprehensive review: critics=[CriticType.SELF_REFINE, CriticType.REFLEXION]
Q: How much does it cost?
A: Costs depend on your LLM provider, model choice, text length, iterations, and critic count. Typical improvements cost $0.001-0.01 per text with efficient models (GPT-3.5-turbo, Gemini Flash).
Q: Which critic should I use?
SELF_REFINEREFLEXION or SELF_RAGSTYLECONSTITUTIONALSELF_CONSISTENCY or N_CRITICSQ: Can I create custom critics?
A: Yes! Implement the CriticPlugin interface (see examples/ for reference implementations).
Q: How can I improve performance?
max_iterations=1 or 2Q: Does Sifaka cache results?
A: Not by default. Use FileStorage or RedisStorage to save results, or implement custom caching.
Q: Why am I getting timeout errors?
A: Increase timeout_seconds, reduce max_iterations, or use faster models.
Q: Why isn’t my text improving?
A: Try different temperature settings (0.7-0.9), different critics, larger models, or check input text quality.
Q: How do I debug issues?
A: Enable logging: logging.basicConfig(level=logging.DEBUG) or use Logfire integration.
Q: Can I use Sifaka in production?
A: Yes, but it’s alpha software. Features: error handling, timeouts, connection pooling, storage backends, monitoring integration.
Q: Does Sifaka work with async frameworks?
A: Yes! Fully async, works with FastAPI, aiohttp, Django (async views), and any async Python framework.
Q: Is there a synchronous API?
A: Yes: from sifaka import improve_sync
For developers and contributors, see AGENTS.md for:
# Run tests
pytest tests/
# Type checking
mypy sifaka/
# Linting
ruff check .
# Formatting
black .
# Coverage
pytest --cov=sifaka
MIT License - see LICENSE file for details.
Contributions welcome! This is alpha software under active development.
If you use Sifaka in research, please cite the underlying papers:
@article{madaan2023self,
title={Self-Refine: Iterative Refinement with Self-Feedback},
author={Madaan, Aman and others},
journal={arXiv preprint arXiv:2303.17651},
year={2023}
}
@article{shinn2023reflexion,
title={Reflexion: Language Agents with Verbal Reinforcement Learning},
author={Shinn, Noah and others},
journal={arXiv preprint arXiv:2303.11366},
year={2023}
}
@article{bai2022constitutional,
title={Constitutional AI: Harmlessness from AI Feedback},
author={Bai, Yuntao and others},
journal={arXiv preprint arXiv:2212.08073},
year={2022}
}