How Pindu Works

8 min read  ·  Pindu Resources

Pindu supports extensive reading with a vocabulary-centric approach. The software combines a user-aware text generation engine with a tool-rich reader interface built on top of Anki's spaced repetition system. Together, these components facilitate the efficient growth of vocabulary and the steady improvement of reading comprehension in Chinese as a second language. This article explains how.

How Pindu Models Your Vocabulary

Before Pindu can generate or evaluate a reading passage, it needs to know what you know. So, every session begins by reading your Anki collection and converting it into a model of your vocabulary.

Reading your Anki collection

When you start a Pindu session, the add-on connects to a subset of your Anki collection. For each card in the collection, Pindu extracts the due date and stability value. From this data, Pindu determines two things about each word in your collection: its due status and its retrievability.

Pindu buckets the due status in a straightforward way. A word is either Due (Anki would show it to you today in a Classic review session), Scheduled (it's in your collection but not due yet), or Absent (it's not in your collection at all). These categories are useful for display (e.g., they map to the color coding you see in the reader), but they are not formally part of the vocabulary model.

Building the vocabulary model

A card's retrievability (from 0% to 100%) says how likely you are to recall a word at the moment of collection access. Pindu computes retrievability for every word in your collection using the FSRS power forgetting curve, the same model that underlies Anki's most recent scheduler. The inputs are the word's stability and the time elapsed since you last reviewed it (both pulled from the Anki collection data). The output is a probability between 0 and 1. A word you reviewed yesterday with high stability might have a retrievability of 0.99 — near certainty. A word you haven't seen in three months with low stability might sit at 0.4.

The collection of words and their retrievabilities is the basis of Pindu's vocabulary model for a given user during that session. Beyond the list of words Pindu knows you know about via the Anki collection, it also imputes your familiarity with additional words. See "Trivial compounds and edge cases" below for more detail on that.

Using the vocabulary model

Pindu uses the continuous probabilities represented by retrievability throughout the app. They're used to evaluate how readable a passage is for you, to decide which words need to appear in generated text, and to drive the recall-based color modes in the reader.

Pindu also needs categorical labels for display and decision-making. Each word in your collection is classified as Confident (retrievability at or above a threshold) or Hesitant (below the threshold). Words not in your collection at all are Unseen. The retrievability threshold is configurable from the main Preparation screen in Pindu, but defaults to 95%. These three categories appear in the reader's "Recall Band" color mode, where Confident, Hesitant, and Unseen words are visually distinct, giving you an at-a-glance sense of how much of the passage you're expected to know.

How Pindu Creates and Modifies Reading Passages

With the vocabulary model built, Pindu can build and evaluate reading material. The generation engine is designed to calibrate text difficulty: comprehensible with a controlled density of words that challenge you. The engine is general enough to offer several ways to get there, depending on whether you want to use AI-generated text, your own Anki card content, or text you've found yourself.

Readability as mean segment retrievability

How does Pindu decide whether a passage is comprehensible? It defines its own "readability" metric as the average retrievability across all the words in the passage. If a passage contains 100 words and the average retrievability across them is 0.88, then the readability score is 0.88. In the literature, a comprehensibility of at least 95% is usually suggested for extensive reading, by which most researchers figure 19 familiar words for every 1 unfamiliar word. Pindu's reckoning is actually a bit more nuanced, since word retrievabilities are not binary.

It's worth emphasizing that this metric is personal. The same passage will have different readability scores for different readers, because they depend on individual vocabulary models. A passage about economics might score 0.92 for a business-focused learner and 0.65 for someone who's been studying literature. This is a big reason that Pindu's approach is different from static graded readers that assign a single difficulty level to a text.

Four ways to source text

Pindu offers four pre-configured workflows that differ based on how the text is initially sourced:

WorkflowText source
Study Cards (Existing)Example sentences already in your Anki cards
Study Cards (AI)LLM-generated sentence per card
Study CollectionLLM-generated passage on a topic of your choice
Read Any TextText you paste in or import from a URL

Study Cards keeps each review focused on a single Anki card. Study Collection builds a long-form continuous passage from scratch generated on a topic you choose using your Anki collection as steering. Read Any Text is for when you've already found something you want to read and just want to read it in Pindu to get the benefits of SRS integration and/or re-leveling.

Releveling and rewording

Text generation is a multi-stage pipeline, and the initial sourcing is only the first step. After the workflow has been seeded with input text from any of the above methods, that text can be optionally modified with an LLM to improve personalized readability and incorporate specific vocabulary.

Releveling rewrites the passage to meet a target readability score. Pindu evaluates the generated text against your vocabulary model, identifies words that are above your level, and iteratively asks the LLM to modify the passage, substituting unfamiliar words for ones you know while maintaining semantic similarity to the original text. This stage runs in a loop: generate a revision, measure readability, check for improvement, and repeat until the target is met. Rewording is a simpler process to adjust the passage to include specific words from your Anki deck.

After the pipeline completes, Pindu can filter and rank candidate passages by due-word density or other quality metrics, selecting the best results for your session.

How Pindu Analyzes Chinese Text

Everything described above — vocabulary modeling, readability scoring, due-word matching — depends on a fundamental capability to split Chinese text into individual words, or "segments", for analysis. This is a harder problem than it may sound.

Segmentation and why it's hard in Chinese

Chinese text has no spaces. The string 我喜欢在图书馆看书 must be parsed into 我 / 喜欢 / 在 / 图书馆 / 看书 before any word-level analysis is possible. This is called segmentation, and it's non-trivial because the same sequence of characters can often be split multiple valid ways. Context, grammar, and convention all play a role.

Pindu uses a three-pass segmentation pipeline. First, phrase detection identifies idioms and set phrases (e.g., chengyu) and locks them as anchors. These multi-character units have meanings that can't be derived from their components, so they must be kept intact. Second, word segmentation processes the remaining text using jieba, a widely used open-source Chinese word segmenter, respecting the phrase anchors from the first pass. Third, character decomposition breaks each word into its component characters. The result is a three-level hierarchy; each level has its own use in the app.

What happens if segmentation fails? The biggest problem is that reviews can end up with artificially low readability scores, since the segments that are being analyzed do not conform to the user's vocabulary model. This is an ongoing area of experimentation for Pindu. If segmentation is wrong, you can manually select a sequence of characters in the reader and define it as a segment.

"Trivial compounds" and other edge cases in the vocabulary model

Chinese frequently forms compound words by adding common suffixes or particles to a base word. 人 (person) becomes 人们 (people); 学习 (to study) might appear as 学习者 (student/learner). These are what Pindu calls trivial compounds: words whose meaning is transparently derivable from a stem you already know plus a common morpheme. Pindu detects trivial compounds during analysis. If 人 is in your Anki deck but 人们 is not, Pindu recognizes that your knowledge of 人 likely extends to 人们 — you don't need a separate card for it. This detection feeds into both the readability calculation (trivial compounds inherit the stem's retrievability rather than defaulting to zero) and the reader display.

Pindu also assumes familiarity with the most basic Chinese vocabulary. Pindu maintains a small curated list (a few dozen words) of hyper-common segments — numbers, basic pronouns, and similar — that appear frequently in Chinese text but aren't included in any HSK vocabulary list. These were identified through segmentation and frequency analysis of Chinese web content and are treated as on par with HSK 1 or below. There's little reason to penalize a passage's readability score because 一个, 十八, or 不是 aren't in your collection.

These categories of implied vocabulary give rise to two additional retrievability categories in the vocabulary model beyond Confident, Hesitant, and Unseen. A word classified as Likely Confident is not in your Anki deck, but Pindu infers that you probably know it, either because it's a trivial compound of a Confident word, or because it belongs to the most basic vocabulary tier. Likely Hesitant means Pindu infers you've seen the stem but may struggle with it, because the underlying word is itself Hesitant. These inferred categories let Pindu give you a more accurate picture of what you know than strict deck membership would allow. In the reader, they're visually distinguished from tracked words so you can tell the difference between "I have a card for this" and "Pindu thinks I know this."

How the Pindu Reader Works

The reader is where you spend most of your Pindu time. It displays the processed text in a clean, reading-optimized layout and provides scaffolding and tools to help you work through unfamiliar vocabulary and grammar without leaving the reading flow.

Color-coded status at a glance

In the default "Due Status" color mode, words are color-coded by their Anki scheduling status:

  • Blue: Due for review today (words Anki would show you in Classic mode)
  • Dark: Scheduled for review on a future date
  • Red: Not in your Anki collection

Two alternative color modes are available. Recall Band colors words by retrievability category (Confident, Hesitant, Unseen), giving a view based on predicted memory strength rather than queue position. Recall Gradient applies a continuous color scale from red (low retrievability) to green (high), letting you see the full spectrum of how well you know each word in the passage.

Click-to-mark: Hard, Again, or move on

When you encounter a word that gives you trouble, you click it:

  • No click: "Good." You read it without difficulty. Most words should be "good" and require no interaction if you are reading extensively with a high-readability passage.
  • One click: "Hard." You got it, but it was a struggle.
  • Two clicks: "Again." You missed it or couldn't recall it.

These map directly to Anki's native review ratings. "Good" advances the card on its normal schedule. "Hard" shortens the next interval. "Again" sends the card back to relearning. If the same word appears multiple times in a passage, clicking one instance marks all occurrences at the same rating.

The interface is designed to stay out of the way. You read. You click when something is hard. You keep reading. For most words, the overhead is zero.

On-demand tools: dictionary, TTS, translation, chat

A side panel provides comprehension scaffolding:

  • Dictionary lookup: English definitions, pinyin romanization, and a dictionary entry for the selected word. These are synced with the Anki note for that word if possible, otherwise they're filled with an LLM via API.
  • Text-to-speech: pronunciation of individual clicked words or the full displayed passage. This is synced with the Anki note if possible, otherwise it's generated by AI TTS models via API.
  • Translation: English translation of the full passage text, done via an LLM API.
  • Grammar chat: a chat assistant pre-loaded with the passage context, for asking questions about specific vocabulary or grammatical structures. This is powered by an LLM API.
  • Note creation and editing: add an unfamiliar word to your Anki collection on the spot, or edit the card for an existing word.

How Pindu Gives SRS Credit for Reading

The final step in a Pindu session is committing reviews to Anki. This is where reading becomes reviewing and powers the next session.

How marks map to Anki reviews

When you finish a passage and commit, Pindu submits the passage's words to Anki just as if you had reviewed each of them in a Classic review session. Anki's scheduler updates each card's due date and interval accordingly. Words you've reviewed in Pindu won't appear in today's Classic queue. Your remaining review burden decreases in proportion to how much you've read.

The mapping is direct: every due word you encountered in the passage gets a review. Words you didn't click get "Good" (full credit, normal interval progression). Words you clicked once get "Hard" (reduced interval). Words you double-clicked get "Again" (card returns to relearning). The same ratings as Classic review — just issued while reading connected prose instead of flipping isolated cards.

Why there's no "Easy" button

Anki's Classic review offers four rating buttons: Again, Hard, Good, and Easy. Pindu intentionally omits Easy. In Classic review, "Easy" aggressively extends a card's interval. That is appropriate when you've genuinely overlearned a word in isolation and want to push it further out, but recognizing a word fluently in a reading passage doesn't necessarily mean the knowledge has generalized. You understood it in this context, with surrounding words providing clues. Omitting "Easy" keeps intervals from inflating on the basis of contextual fluency that may not transfer to other settings. "Good" — the default for every word you read without difficulty — provides full SRS credit with normal interval progression.

Try Pindu

It's free! Build reading comprehension and vocabulary through extensive reading and spaced repetition.

Get the Anki Add-On

Stay in the loop