Open Source Library

Deterministic Prose Tokenizer for Python

Break English prose and Markdown into clean, structured data. Built for AI pipelines, writing tools, and text analysis where consistency and speed matter.

Zero dependencies, fully typed, and deterministic. Built for chunking text for RAG or building editorial guardrails in Python.

Overview

Break English prose and Markdown into clean, structured data. Built for AI pipelines, writing tools, and text analysis where consistency and speed matter.

Markdown Aware
Structural Logic

Identify headings, list items, and blockquotes as distinct blocks. Clean your text while keeping its structural meaning intact.

Dependency Free
Pure Python

Lightweight design with zero runtime requirements. Small enough to run on edge servers or inside serverless functions.

Safe Splitting
Safe Segmentation

Handle edge cases like 'U.S.A.', 'v1.0', and decimals without breaking sentences. Get reliable results every single time.

Current workflow

Standard string splitting in Python is too limited for complex prose and Markdown content.

  1. Splitting on periods often breaks acronyms and abbreviations into wrong parts.
  2. Markdown syntax like headings and lists are often lost during basic text cleanup.
  3. Heavy NLP models like spaCy add too much weight for structural text tasks.
  4. Inconsistent chunking can hurt the quality of your RAG or AI pipelines.
  5. Handling edge cases by hand leads to messy regex and hard-to-maintain code.

Where it breaks

These gaps can slow down your development and lower the accuracy of your AI tools.

  • Large dependencies bloat your Docker images and increase cold-start times.
  • Unreliable sentence splitting causes lost context in LLM chunks.
  • Manual text cleaning is slow and difficult to keep consistent across projects.
  • Non-deterministic logic makes regression testing and debugging a nightmare.

The Python Tokenization Pipeline

prose-tokenizer provides a typed interface for repeatable text analysis.

Block Discovery

Scan for Markdown blocks like headings and lists before processing the prose.

Sentence Logic

Apply rules to find real sentence boundaries while protecting initials and titles.

Lexical Analysis

Get clean word tokens while keeping contractions and hyphenated words together.

Full Metrics

Get total word counts, sentence lengths, and character metrics in one call.

Verified request

Request
# Install the package
pip install prose-tokenizer

# Usage in your project
from prose_tokenizer import tokenize

content = """
### Q1 Review
The U.S.A. economy grew by 2.5% in Q1. 

*   Growth was driven by tech.
"""

doc = tokenize(content)

print(doc.counts.word_count)  # 15
print(doc.blocks[0].kind)     # "heading"
print(doc.sentences[1])       # "The U.S.A. economy grew by 2.5% in Q1."

Verified response

Structured Document Data

The library returns a typed document object with all your metrics and blocks.

{
  "counts": {
    "word_count": 15,
    "sentence_count": 2,
    "character_count": 82
  },
  "blocks": [
    {"kind": "heading", "text": "Q1 Review"},
    {"kind": "paragraph", "text": "The U.S.A. economy grew by 2.5% in Q1."},
    {"kind": "list_item", "text": "Growth was driven by tech."}
  ]
}

Output interpretation

The library is built for speed and clarity. It gives you the signals you need to build better tools.

  • Markdown Support: Handles both standard and Setext headings, plus all list types.
  • Safe Sentence Splitting: Protects common English abbreviations and initials.
  • Fully Typed: Built with PEP 484 type hints for a better developer experience.
  • Deterministic: The same text always yields the same structured result.
  • Zero Config: Works out of the box with standard English prose rules.

Practical Usage: AI Preprocessing

Use prose-tokenizer to create better chunks for your AI and RAG pipelines.

  1. Load your raw Markdown or prose content into your Python script.
  2. Pass the text into the tokenizer to get a structured document object.
  3. Chunk your data by sentence or paragraph to keep related thoughts together.
  4. Remove Markdown noise while keeping the structural meaning for the AI.
  5. Feed clean, structured data into your vector database or LLM.
RAG Chunking Example
from prose_tokenizer import tokenize

doc = tokenize(large_text)

# Chunk by paragraph for better context
for para in doc.paragraphs:
    save_to_vector_db(para)

Choosing a Python Tokenizer

Pick the right tool for your specific text analysis or preprocessing task.

Basic Split

Fast but breaks on acronyms and loses all Markdown and structural data.

Heavy NLP

Capable but slow and memory-heavy. Often overkill for basic structural tasks.

prose-tokenizer

Fast, light, and stable. Built for the intersection of Markdown and English prose.

Keep Exploring

Use the Workflow Library to browse more guides, comparisons, and integration examples to continue your evaluation.

Clean up your Python text pipelines

Explore the package on GitHub or install via Pip. Build deterministic text tools in Python.