Open Source Library

Deterministic Prose Tokenizer for Modern Editorial Workflows

Break English prose and Markdown content into paragraphs, sentences, and words with deterministic rule-based segmentation. Built for writing analysis, AI pipelines, and editorial automation where consistency matters.

Stable, inspectable, and lightweight. Designed for environments where you need to map analysis back to original character offsets.

Overview

Break English prose and Markdown content into paragraphs, sentences, and words with deterministic rule-based segmentation. Built for writing analysis, AI pipelines, and editorial automation where consistency matters.

Markdown Aware
Contextual Logic

Handles headings, list items, and blockquotes as distinct structural containers. It prevents syntax collisions and ensures unrelated lines aren't merged into the same sentence.

Lightweight
Minimal Dependencies

Small footprint designed for portability. Runs across Node.js, current browsers, and edge environments with predictable performance.

Rule-Based
Abbreviation Handling

Uses deterministic rules to handle common edge cases like 'U.S.A.', 'e.g.', decimals, and trailing punctuation without breaking segments.

Current workflow

Standard string splitting often fails when prose contains structured content or complex punctuation.

  1. Naive splitting on periods often breaks 'U.S.A.' or 'v1.0' into multiple incorrect segments.
  2. Missing Markdown context can treat heading hashes (###) as part of the first word token.
  3. List items may be merged into the preceding paragraph due to inconsistent newline handling.
  4. Character offsets are often lost, making it difficult to highlight specific segments in an editor UI.
  5. Heavy NLP models can add unnecessary latency for basic structural segmentation tasks.

Where it breaks

These issues can complicate downstream analysis and AI retrieval pipelines.

  • Inconsistent segments can lead to inaccurate readability or length metrics.
  • AI chunking might lose context if sentences are sliced incorrectly mid-thought.
  • Mapping analysis results back to the original source string requires manual offset math.
  • Linguistic edge cases like abbreviations often require custom regex workarounds.

The Tokenization Pipeline

The Veldica Prose Tokenizer uses a multi-stage transparent pipeline to ensure consistent results.

Container Discovery

Scan for Markdown-style blocks like headings, lists, and code blocks before processing prose.

Sentence Boundary

Apply segmentation rules to distinguish between terminal punctuation and internal decimals or abbreviations.

Lexical Analysis

Decompose sentences into semantic tokens, identifying words, numeric values, and punctuation symbols.

Offset Mapping

Calculate and preserve character-accurate offsets for every element in the resulting syntax tree.

Verified request

Request
# Install the package
npm install @veldica/prose-tokenizer

# Usage in your project
import { tokenize } from '@veldica/prose-tokenizer';

const content = `
### Q1 Review
The U.S.A. economy grew by 2.5% in Q1. 

*   Growth was driven by tech.
*   Inflation remains stable at 2.1%.

This is a fact.
\`;

const doc = tokenize(content);

Verified response

Structured Output

The tokenizer returns a hierarchy of blocks with line metadata and aggregate counts.

{
  "blocks": [
    {
      "text": "### Q1 Review",
      "kind": "heading",
      "line_start": 1,
      "line_end": 1
    },
    {
      "text": "The U.S.A. economy grew by 2.5% in Q1.",
      "kind": "paragraph",
      "line_start": 3,
      "line_end": 3
    },
    {
      "text": "* Growth was driven by tech.",
      "kind": "list_item",
      "line_start": 5,
      "line_end": 5
    }
  ],
  "counts": {
    "word_count": 21,
    "sentence_count": 4,
    "paragraph_count": 5,
    "heading_count": 1,
    "list_item_count": 2
  }
}

Output interpretation

The output is designed for easy traversal. It preserves structural hierarchy while exposing granular prose details.

  • Structural Hierarchy: Content is grouped into blocks (kind: paragraph, heading, or list_item).
  • Markdown Awareness: Headings and list items are identified and separated from standard prose paragraphs.
  • Sentence Segmentation: Sentences are identified using deterministic logic, handling common abbreviations and punctuation.
  • Lexical Tokens: The package provides splitWords() and isStopword() for full linguistic analysis.
  • Line Metadata: Every block includes line_start and line_end indices for mapping back to the source.

Practical Usage: Writing Analysis

Use the tokenizer as a stable first step in a content analysis workflow.

  1. Pass raw Markdown or prose into the tokenizer.
  2. Traverse the blocks to analyze specific sections (e.g., headings vs body).
  3. Calculate metrics like average sentence length or word density.
  4. Identify specific sentence 'hotspots' that are too long or complex.
  5. Feed clean, segmented chunks into downstream AI or database pipelines.
Editorial Analysis Example
import { tokenize, splitWords } from '@veldica/prose-tokenizer';

const doc = tokenize(content);

doc.sentences.forEach(s => {
  const words = splitWords(s);
  if (words.length > 30) {
    console.log(`Long sentence detected: "\${s.substring(0, 50)}..."`);
  }
});

Choosing a Segmentation Method

Standard string splitting and heavy NLP models both have tradeoffs for structural segmentation tasks.

String.split()

Simple but prone to errors. Splitting on '.' breaks on 'e.g.', 'v2.0', and 'Dr. Smith', and loses all character offset information.

Heavy NLP (SpaCy/NLTK)

Capable but may be overkill for structural tasks. Often requires a large runtime and can introduce non-deterministic behavior.

Veldica Prose Tokenizer

Fast and deterministic. Minimal dependencies and specifically tuned for the intersection of Markdown and English prose.

Keep Exploring

Use the Workflow Library to browse more guides, comparisons, and integration examples to continue your evaluation.

Build better editorial tools

Explore the package on GitHub or install via NPM. We're building the tools we use for our own analysis pipelines.