# hierarchical clustering of chinese characters for chunked learning
this document describes a tree-building procedure for organizing simplified chinese characters into semantically meaningful, size-balanced clusters suitable for incremental learning.

the method combines fixed parent-child relationships with dynamic structural adjustments to preserve subgroup integrity while managing group size. techniques include priority-based root selection, demotion of overloaded components, fan-out of large subtrees, and optional merging of small groups.
## 1. introduction
chunked character learning benefits from groupings that are:
* semantically coherent
* reasonably bounded in size
* distinct and non-overlapping

this algorithm builds a directed tree over a character set using component data to generate such groupings, emphasizing:
* cohesion: prioritizing complete subtrees under shared components
* distinctiveness: avoiding semantic flattening under common radicals
* distribution: balancing group sizes for manageable learning sessions
## 2. algorithm structure
### 2.1 data ingestion
input data consists of `(parent, character, pronunciation)` triples. the ingestion phase constructs:
* `children`: direct parent → child proposals
* `pinyin`: character → pronunciation
* `cand`: character → candidate parent list
* `pot`: parent → all characters listing it as candidate
### 2.2 parent assignment
#### phase 1: subtree claiming
* priority forms and sufficiently connected parents claim all of their unassigned children at once
* a group is only claimed if:
  * the parent is in a priority set
  * or the candidate group meets a minimum size and is not over capacity
  * and the parent is not on a demotion list (unless prioritized)
#### phase 2: greedy assignment
* remaining characters are assigned to the best available parent based on:
  * avoiding demoted components
  * minimizing existing cluster size
  * favoring phonetic or semantic coherence
### 2.3 cluster adjustment
#### 2.3.1 merging small roots
* roots with fewer than a minimum threshold of children are absorbed into compatible larger parents
* this avoids fragmentation and aligns residual characters with meaningful groupings
#### 2.3.2 fan-out of large roots
* any root with size > `max_size` is scanned for direct children that form large enough subtrees
* these children are promoted to root status
* this avoids excessive cluster size while maintaining structural clarity
### 2.4 output structure
after normalization, the system outputs:
* `children_map`: parent → list of assigned children
* `pinyin_map`: character → pronunciation
* `roots`: characters not assigned as children (top-level groups)
* `size_map`: subtree size for each node
## 3. results
run on a corpus of 8141 simplified characters, the system produced:
* total root groups: 378
* mean cluster size: \~22
* median size: \~16
* largest cluster: 69
* priority roots retained complete subtrees
* demoted radicals avoided dominance
* small phonetic subgroups were preserved where appropriate
* a wildcard bucket for leftover groups with size < 3

## 4. promoted and demoted characters
example character selections for promotion as root and demotion.

### promoted components
```
朵 殳 圣 吴 召 奈 青 齐 步 𢀖 咅 否 音 至 亲 吉 㕛 台 另 古 去 妾 辛 尗 责 育 幸 舌 君 支 亘 旦 瓜
```

these characters were prioritized during parent selection. when eligible, they were allowed to form their own root groups and claim all their candidate children. the selection includes:

* commonly reused semantic or phonetic bicomponents
* visually distinct forms that support chunked recognition
* characters suitable as mnemonic anchors for subtrees

### demoted components
```
氵 忄 讠 饣 扌 刂 阝 扌 犭 纟 钅 忄 彳 衤 灬 罒 亻 冫 月 牜 礻 𧾷 口 土 人 木
```

these components were penalized or excluded from root status unless explicitly prioritized. demotion was applied to:

* small graphical affixes whose presence is not visually dominant
* highly frequent radicals prone to producing oversized, semantically diffuse groups

demoted forms were still permitted as internal components and as part of subtrees, but they were disfavored in root selection to preserve semantic coherence and manage group size.

## 5. notes
* the method does not guarantee minimal depth or uniqueness of internal structure; it emphasizes semantic group utility over strict hierarchy.
* component demotion and prioritization are customizable.
* grouping choices balance pedagogical value with structural constraints rather than linguistic etymology.