2025-05-04

hierarchical clustering of chinese characters for chunked learning

this document describes a tree-building procedure for organizing simplified chinese characters into semantically meaningful, size-balanced clusters suitable for incremental learning.

the method combines fixed parent-child relationships with dynamic structural adjustments to preserve subgroup integrity while managing group size. techniques include priority-based root selection, demotion of overloaded components, fan-out of large subtrees, and optional merging of small groups.

1. introduction

chunked character learning benefits from groupings that are:

semantically coherent
reasonably bounded in size
distinct and non-overlapping

this algorithm builds a directed tree over a character set using component data to generate such groupings, emphasizing:

cohesion: prioritizing complete subtrees under shared components
distinctiveness: avoiding semantic flattening under common radicals
distribution: balancing group sizes for manageable learning sessions

2. algorithm structure

2.1 data ingestion

input data consists of (parent, character, pronunciation) triples. the ingestion phase constructs:

children: direct parent → child proposals
pinyin: character → pronunciation
cand: character → candidate parent list
pot: parent → all characters listing it as candidate

2.2 parent assignment

phase 1: subtree claiming

priority forms and sufficiently connected parents claim all of their unassigned children at once
a group is only claimed if:
- the parent is in a priority set
- or the candidate group meets a minimum size and is not over capacity
- and the parent is not on a demotion list (unless prioritized)

phase 2: greedy assignment

remaining characters are assigned to the best available parent based on:
- avoiding demoted components
- minimizing existing cluster size
- favoring phonetic or semantic coherence

2.3 cluster adjustment

2.3.1 merging small roots

roots with fewer than a minimum threshold of children are absorbed into compatible larger parents
this avoids fragmentation and aligns residual characters with meaningful groupings

2.3.2 fan-out of large roots

any root with size > max_size is scanned for direct children that form large enough subtrees
these children are promoted to root status
this avoids excessive cluster size while maintaining structural clarity

2.4 output structure

after normalization, the system outputs:

children_map: parent → list of assigned children
pinyin_map: character → pronunciation
roots: characters not assigned as children (top-level groups)
size_map: subtree size for each node

3. results

run on a corpus of 8141 simplified characters, the system produced:

total root groups: 378
mean cluster size: ~22
median size: ~16
largest cluster: 69
priority roots retained complete subtrees
demoted radicals avoided dominance
small phonetic subgroups were preserved where appropriate
a wildcard bucket for leftover groups with size < 3

4. promoted and demoted characters

example character selections for promotion as root and demotion.

promoted components

朵 殳 圣 吴 召 奈 青 齐 步 𢀖 咅 否 音 至 亲 吉 㕛 台 另 古 去 妾 辛 尗 责 育 幸 舌 君 支 亘 旦 瓜

these characters were prioritized during parent selection. when eligible, they were allowed to form their own root groups and claim all their candidate children. the selection includes:

commonly reused semantic or phonetic bicomponents
visually distinct forms that support chunked recognition
characters suitable as mnemonic anchors for subtrees

demoted components

氵 忄 讠 饣 扌 刂 阝 扌 犭 纟 钅 忄 彳 衤 灬 罒 亻 冫 月 牜 礻 𧾷 口 土 人 木

these components were penalized or excluded from root status unless explicitly prioritized. demotion was applied to:

small graphical affixes whose presence is not visually dominant
highly frequent radicals prone to producing oversized, semantically diffuse groups

demoted forms were still permitted as internal components and as part of subtrees, but they were disfavored in root selection to preserve semantic coherence and manage group size.

5. notes

the method does not guarantee minimal depth or uniqueness of internal structure; it emphasizes semantic group utility over strict hierarchy.
component demotion and prioritization are customizable.
grouping choices balance pedagogical value with structural constraints rather than linguistic etymology.