How Pindu Works
How Pindu models your vocabulary, prepares calibrated Chinese text, and supports both intensive and extensive reading.
This article explains the nuts and bolts of Pindu. For more information on the underlying pedagogical approach, see the companion research piece.
A Pindu session moves through four stages (Figure 1): it builds a vocabulary model from your Anki collection, prepares calibrated reading passages, supports you while you read, and commits the session back to Anki as SRS reviews. Those reviews update the vocabulary model, and the cycle repeats the next time you open the app.
Stage 1: Model your vocabulary
Before Pindu can prepare a reading passage, it needs to know what you know. So, every session begins by converting your SRS (Anki) collection into a personalized model of your vocabulary, represented in Figure 2.
Converting your Anki collection into a vocabulary model
When you start a Pindu session, the add-on connects to your SRS (Anki) collection. Modern SRS software encodes card-specific scheduling data that can be used to calculate "retrievability", or the likelihood that you'll successfully recall that card's content (a number from 0-100%). Pindu uses retrievability as the basis for its vocabulary model. Depending on the particular SRS software, Pindu can directly extract retrievability values or else calculate them based on other values like "stability" and "time until next review". (Pindu uses the FSRS power forgetting curve, the same model that underlies Anki's most recent scheduler.) Pindu categorizes each word in your collection for display and decision-making using a threshold: those with greater than 95% retrievability are Confident, while those with lower retrievability are Hesitant. (The retrievability threshold is configurable but defaults to 95%.)
Inferred familiarity in the model
Beyond the words Pindu knows you know about via the SRS/Anki collection, Pindu also infers your knowledge of other words. These words currently derive from three categories: trivial compounds, basic vocabulary, and numerals.
- Trivial compounds. Chinese frequently forms words by adding common suffixes or particles to a base word. For example, 人 (person) becomes 人们 (people); 学习 (to study) becomes 学习者 (student/learner). These are what Pindu calls trivial compounds, words whose meaning is transparently derivable from a stem plus a very common morpheme. Pindu propagates the retrievability of the stem to the compound.
- Basic vocabulary. Pindu maintains a small list (a few dozen words) of hyper-common segments — basic pronouns, common particles, and similar — that appear so frequently in Chinese that they don't even appear in an HSK vocabulary list. (These words are identified through segmentation and frequency analysis of Chinese web content.) Pindu treats these words as already known to the user.
- Numerals. Similar to basic vocabulary, Arabic numerals (e.g., 3, 2024) are also recognized as universally known.
The inferred vocabulary gives rise to two additional categories in the vocabulary model. Likely Confident words may be trivial compounds of Confident words, may belong to the basic vocabulary tier, or may be numerals; Likely Hesitant words are trivial compounds of Hesitant words. These inferred categories let Pindu construct a more nuanced picture of what you know than strict deck membership would allow.
Using the vocabulary model
Pindu uses the continuous probabilities represented by retrievability throughout the app. They're used to evaluate how readable a passage is for you, to decide which words need to appear in generated text, and to drive display modes in the reader.
Stage 2: Prepare reading passages
Using the built vocabulary model, Pindu manages personalized reading material inside its text engine. The two objectives of the engine are first to assess and then to modify the comprehensibility of the source text. The initial text can be anything: existing card content, novel LLM generation inspired by a user prompt, or imported text from an external source like a website or document.
Assessing readability
Pindu defines its own "readability" metric as a quantitative measure of comprehensibility, calculated as the average retrievability (defined via SRS; see previous section) across all the words in the passage. If a passage contains 100 words and the average retrievability across them is 0.85, then the readability score is 0.85, as illustrated in Figure 3. (The research that motivates this kind of metric is reviewed in a companion article.) This metric is completely personal: the same passage will have different readability scores for different readers, because the scores depend on the individual vocabulary models. A passage about economics might score 0.92 for a business-focused learner and 0.65 for someone who's been studying literature. This differentiates Pindu's approach from that of traditional, static graded readers.
Modifying readability
Text preparation is a multi-stage and iterative pipeline. After the workflow has been seeded with input text and assessed for readability, that text can optionally be releveled to a personal level or reworded to incorporate targeted vocabulary.
- Releveling rewrites the passage to meet a target readability score. Pindu identifies words that are dragging down readability, and runs an LLM inside an agentic harness to modify the passage via word substitution and grammar simplification while maintaining semantic similarity to the original text (see Figure 4).
- Rewording is a simpler process that minimally modifies a passage to include specific words that you are studying.
The challenges of segmentation
The evaluation and modification described above depend on a fundamental capability to split text into individual words, or "segments", for analysis. This is a harder problem than it may sound, especially in Chinese, which has no spaces between words. The string 我喜欢在图书馆看书 must be parsed into 我 / 喜欢 / 在 / 图书馆 / 看书 before any word-level analysis is possible. This segmentation is non-trivial because the same sequence of characters can often be split in multiple valid ways. Context, grammar, and convention all play a role in determining the correct segmentation. When segmentation fails, passages can end up with artificially low readability scores, since the segments that are being analyzed do not conform to the user's vocabulary model.
Pindu uses a two-pass segmentation pipeline for most segmentation applications. First, phrase detection identifies idioms and set phrases (e.g., chengyu) and locks them as anchors. These multi-character units have meanings that can't be derived from their components. Second, word segmentation processes the remaining text using jieba, a widely used open-source Chinese word segmenter, respecting the phrase anchors from the first pass.
Stage 3: Support your reading
The Pindu reader is where you spend most of your time in a session. It displays the processed text in a clean, reading-optimized layout and provides glosses, scaffolding, and tools to help you work through unfamiliar vocabulary and grammar without leaving the reading flow. Beyond providing reading support, the reader interface is also where the user performs the critical mark-up of challenging words, which closes the loop for the SRS back-end.
Glossing and formatting for in-flow aid
The Pindu reader offers various options to make a particular text passage easier to read. In terms of glossing, users can enable "hover-to-translate" and "hover-for-pinyin" tooltip features. In terms of formatting, users can insert subtle spacing between segments to make word boundaries easier to parse. Words can also be color-coded by their Anki scheduling status or Pindu vocabulary state. With these options, users can read text with variable amounts of help.
On-demand tools for convenient deep-diving
There are some cases where the user may want to break the reading flow and deep-dive on a word, phrase, or passage: review definitions, understand etymology, hear pronunciation, get cultural notes, etc. Pindu offers several built-in tools, accessible in the sidebar or via a right-click context menu.
- Dictionary lookup: English definitions, pinyin romanization, and a dictionary entry for the selected word. These are synced with the Anki note for that word if possible; otherwise, they're fetched via API.
- Text-to-speech: Pronunciation of individual words (or the full passage). This is synced with the Anki note if possible; otherwise, it's generated via API.
- Translation: English translation of the full passage text, via API.
- Grammar and word chat: An LLM chat assistant pre-loaded with the passage context, for asking questions about specific vocabulary or grammatical structures.
- Note creation and editing: Add an unfamiliar word to your Anki collection on the spot, or edit the card for an existing word.
Click-to-mark for SRS credit
While reading a passage in Pindu, the task is to mark challenging words as you move along by clicking on them. If a word is clicked once, it is marked "Hard" (you had difficulty recalling the meaning or pronunciation); if clicked twice, it is marked "Again" (you couldn't recall it). Words that are not clicked are implicitly marked "Good". See the illustration in Figure 5. If the same word appears multiple times in a passage, clicking one instance marks all occurrences at the same rating.
The interface is designed to stay out of the way. You read. You click when something is hard. You keep reading. For most words, the overhead is zero.
Stage 4: Commit reviews to Anki
The final step in a Pindu session is committing reviews to Anki. This is where reading becomes reviewing and powers the next session. When you finish a passage and commit, Pindu submits the passage's words to Anki just as if you had reviewed each of them in a traditional review session. Anki's scheduler updates each card's due date and interval accordingly. Words you've reviewed in Pindu won't appear in today's traditional review queue. Your remaining review burden decreases in proportion to how much you've read.
The mapping is direct: every due word you encountered in the passage gets a review. Words you didn't click get "Good" (full credit, normal interval progression). Words you clicked once get "Hard" (reduced interval). Words you double-clicked get "Again" (card returns to relearning). The same ratings as a traditional Anki review, just issued while reading connected prose instead of flipping isolated cards.