this document describes a tree-building procedure for organizing simplified chinese characters into semantically meaningful, size-balanced clusters suitable for incremental learning.
the method combines fixed parent-child relationships with dynamic structural adjustments to preserve subgroup integrity while managing group size. techniques include priority-based root selection, demotion of overloaded components, fan-out of large subtrees, and optional merging of small groups.
chunked character learning benefits from groupings that are:
input data consists of (parent, character, pronunciation)
triples. the ingestion phase constructs:
children
: direct parent → child proposalspinyin
: character → pronunciationcand
: character → candidate parent listpot
: parent → all characters listing it as candidatea group is only claimed if:
remaining characters are assigned to the best available parent based on:
max_size
is scanned for direct children that form large enough subtreesafter normalization, the system outputs:
children_map
: parent → list of assigned childrenpinyin_map
: character → pronunciationroots
: characters not assigned as children (top-level groups)size_map
: subtree size for each noderun on a corpus of 8141 simplified characters, the system produced:
example character selections for promotion as root and demotion.
朵 殳 圣 吴 召 奈 青 齐 步 𢀖 咅 否 音 至 亲 吉 㕛 台 另 古 去 妾 辛 尗 责 育 幸 舌 君 支 亘 旦 瓜
these characters were prioritized during parent selection. when eligible, they were allowed to form their own root groups and claim all their candidate children. the selection includes:
氵 忄 讠 饣 扌 刂 阝 扌 犭 纟 钅 忄 彳 衤 灬 罒 亻 冫 月 牜 礻 𧾷 口 土 人 木
these components were penalized or excluded from root status unless explicitly prioritized. demotion was applied to:
demoted forms were still permitted as internal components and as part of subtrees, but they were disfavored in root selection to preserve semantic coherence and manage group size.