Deterministic Prose Tokenizer for Python
Break English prose and Markdown into clean, structured data. Built for AI pipelines, writing tools, and text analysis where consistency and speed matter.
Zero dependencies, fully typed, and deterministic. Built for chunking text for RAG or building editorial guardrails in Python.
Overview
Break English prose and Markdown into clean, structured data. Built for AI pipelines, writing tools, and text analysis where consistency and speed matter.
Identify headings, list items, and blockquotes as distinct blocks. Clean your text while keeping its structural meaning intact.
Lightweight design with zero runtime requirements. Small enough to run on edge servers or inside serverless functions.
Handle edge cases like 'U.S.A.', 'v1.0', and decimals without breaking sentences. Get reliable results every single time.
Current workflow
Standard string splitting in Python is too limited for complex prose and Markdown content.
- Splitting on periods often breaks acronyms and abbreviations into wrong parts.
- Markdown syntax like headings and lists are often lost during basic text cleanup.
- Heavy NLP models like spaCy add too much weight for structural text tasks.
- Inconsistent chunking can hurt the quality of your RAG or AI pipelines.
- Handling edge cases by hand leads to messy regex and hard-to-maintain code.
Where it breaks
These gaps can slow down your development and lower the accuracy of your AI tools.
- Large dependencies bloat your Docker images and increase cold-start times.
- Unreliable sentence splitting causes lost context in LLM chunks.
- Manual text cleaning is slow and difficult to keep consistent across projects.
- Non-deterministic logic makes regression testing and debugging a nightmare.
The Python Tokenization Pipeline
prose-tokenizer provides a typed interface for repeatable text analysis.
Scan for Markdown blocks like headings and lists before processing the prose.
Apply rules to find real sentence boundaries while protecting initials and titles.
Get clean word tokens while keeping contractions and hyphenated words together.
Get total word counts, sentence lengths, and character metrics in one call.
Verified request
# Install the package
pip install prose-tokenizer
# Usage in your project
from prose_tokenizer import tokenize
content = """
### Q1 Review
The U.S.A. economy grew by 2.5% in Q1.
* Growth was driven by tech.
"""
doc = tokenize(content)
print(doc.counts.word_count) # 15
print(doc.blocks[0].kind) # "heading"
print(doc.sentences[1]) # "The U.S.A. economy grew by 2.5% in Q1."
Verified response
The library returns a typed document object with all your metrics and blocks.
{
"counts": {
"word_count": 15,
"sentence_count": 2,
"character_count": 82
},
"blocks": [
{"kind": "heading", "text": "Q1 Review"},
{"kind": "paragraph", "text": "The U.S.A. economy grew by 2.5% in Q1."},
{"kind": "list_item", "text": "Growth was driven by tech."}
]
}
Output interpretation
The library is built for speed and clarity. It gives you the signals you need to build better tools.
- Markdown Support: Handles both standard and Setext headings, plus all list types.
- Safe Sentence Splitting: Protects common English abbreviations and initials.
- Fully Typed: Built with PEP 484 type hints for a better developer experience.
- Deterministic: The same text always yields the same structured result.
- Zero Config: Works out of the box with standard English prose rules.
Practical Usage: AI Preprocessing
Use prose-tokenizer to create better chunks for your AI and RAG pipelines.
- Load your raw Markdown or prose content into your Python script.
- Pass the text into the tokenizer to get a structured document object.
- Chunk your data by sentence or paragraph to keep related thoughts together.
- Remove Markdown noise while keeping the structural meaning for the AI.
- Feed clean, structured data into your vector database or LLM.
from prose_tokenizer import tokenize
doc = tokenize(large_text)
# Chunk by paragraph for better context
for para in doc.paragraphs:
save_to_vector_db(para)
Choosing a Python Tokenizer
Pick the right tool for your specific text analysis or preprocessing task.
Fast but breaks on acronyms and loses all Markdown and structural data.
Capable but slow and memory-heavy. Often overkill for basic structural tasks.
Fast, light, and stable. Built for the intersection of Markdown and English prose.
Keep Exploring
Use the Workflow Library to browse more guides, comparisons, and integration examples to continue your evaluation.
See the solutions, comparisons, and integration guides collected in one place.
Review grounded audit, compare, fix-plan, and report excerpts before you wire the API into anything.
Jump from the workflow page into the quickstart, endpoint guides, and full OpenAPI reference.
Clean up your Python text pipelines
Explore the package on GitHub or install via Pip. Build deterministic text tools in Python.