Deterministic Prose Tokenizer for Modern Editorial Workflows
Break English prose and Markdown content into paragraphs, sentences, and words with deterministic rule-based segmentation. Built for writing analysis, AI pipelines, and editorial automation where consistency matters.
Stable, inspectable, and lightweight. Designed for environments where you need to map analysis back to original character offsets.
Overview
Break English prose and Markdown content into paragraphs, sentences, and words with deterministic rule-based segmentation. Built for writing analysis, AI pipelines, and editorial automation where consistency matters.
Handles headings, list items, and blockquotes as distinct structural containers. It prevents syntax collisions and ensures unrelated lines aren't merged into the same sentence.
Small footprint designed for portability. Runs across Node.js, current browsers, and edge environments with predictable performance.
Uses deterministic rules to handle common edge cases like 'U.S.A.', 'e.g.', decimals, and trailing punctuation without breaking segments.
Current workflow
Standard string splitting often fails when prose contains structured content or complex punctuation.
- Naive splitting on periods often breaks 'U.S.A.' or 'v1.0' into multiple incorrect segments.
- Missing Markdown context can treat heading hashes (###) as part of the first word token.
- List items may be merged into the preceding paragraph due to inconsistent newline handling.
- Character offsets are often lost, making it difficult to highlight specific segments in an editor UI.
- Heavy NLP models can add unnecessary latency for basic structural segmentation tasks.
Where it breaks
These issues can complicate downstream analysis and AI retrieval pipelines.
- Inconsistent segments can lead to inaccurate readability or length metrics.
- AI chunking might lose context if sentences are sliced incorrectly mid-thought.
- Mapping analysis results back to the original source string requires manual offset math.
- Linguistic edge cases like abbreviations often require custom regex workarounds.
The Tokenization Pipeline
The Veldica Prose Tokenizer uses a multi-stage transparent pipeline to ensure consistent results.
Scan for Markdown-style blocks like headings, lists, and code blocks before processing prose.
Apply segmentation rules to distinguish between terminal punctuation and internal decimals or abbreviations.
Decompose sentences into semantic tokens, identifying words, numeric values, and punctuation symbols.
Calculate and preserve character-accurate offsets for every element in the resulting syntax tree.
Verified request
# Install the package
npm install @veldica/prose-tokenizer
# Usage in your project
import { tokenize } from '@veldica/prose-tokenizer';
const content = `
### Q1 Review
The U.S.A. economy grew by 2.5% in Q1.
* Growth was driven by tech.
* Inflation remains stable at 2.1%.
This is a fact.
\`;
const doc = tokenize(content);
Verified response
The tokenizer returns a hierarchy of blocks with line metadata and aggregate counts.
{
"blocks": [
{
"text": "### Q1 Review",
"kind": "heading",
"line_start": 1,
"line_end": 1
},
{
"text": "The U.S.A. economy grew by 2.5% in Q1.",
"kind": "paragraph",
"line_start": 3,
"line_end": 3
},
{
"text": "* Growth was driven by tech.",
"kind": "list_item",
"line_start": 5,
"line_end": 5
}
],
"counts": {
"word_count": 21,
"sentence_count": 4,
"paragraph_count": 5,
"heading_count": 1,
"list_item_count": 2
}
}
Output interpretation
The output is designed for easy traversal. It preserves structural hierarchy while exposing granular prose details.
- Structural Hierarchy: Content is grouped into
blocks(kind:paragraph,heading, orlist_item). - Markdown Awareness: Headings and list items are identified and separated from standard prose paragraphs.
- Sentence Segmentation: Sentences are identified using deterministic logic, handling common abbreviations and punctuation.
- Lexical Tokens: The package provides
splitWords()andisStopword()for full linguistic analysis. - Line Metadata: Every block includes
line_startandline_endindices for mapping back to the source.
Practical Usage: Writing Analysis
Use the tokenizer as a stable first step in a content analysis workflow.
- Pass raw Markdown or prose into the tokenizer.
- Traverse the blocks to analyze specific sections (e.g., headings vs body).
- Calculate metrics like average sentence length or word density.
- Identify specific sentence 'hotspots' that are too long or complex.
- Feed clean, segmented chunks into downstream AI or database pipelines.
import { tokenize, splitWords } from '@veldica/prose-tokenizer';
const doc = tokenize(content);
doc.sentences.forEach(s => {
const words = splitWords(s);
if (words.length > 30) {
console.log(`Long sentence detected: "\${s.substring(0, 50)}..."`);
}
});
Choosing a Segmentation Method
Standard string splitting and heavy NLP models both have tradeoffs for structural segmentation tasks.
Simple but prone to errors. Splitting on '.' breaks on 'e.g.', 'v2.0', and 'Dr. Smith', and loses all character offset information.
Capable but may be overkill for structural tasks. Often requires a large runtime and can introduce non-deterministic behavior.
Fast and deterministic. Minimal dependencies and specifically tuned for the intersection of Markdown and English prose.
Keep Exploring
Use the Workflow Library to browse more guides, comparisons, and integration examples to continue your evaluation.
See the solutions, comparisons, and integration guides collected in one place.
Review grounded audit, compare, fix-plan, and report excerpts before you wire the API into anything.
Jump from the workflow page into the quickstart, endpoint guides, and full OpenAPI reference.
Build better editorial tools
Explore the package on GitHub or install via NPM. We're building the tools we use for our own analysis pipelines.