2020-10-03

measuring, increasing and decreasing subsequence repetition

calculating

unique-max

the maximum possible count of unique overlapping subsequences for one subsequence width in a parent sequence of size. size must be equal or greater than width.

size - (width - 1)

unique-all-max

like unique-max but for all possible widths

sum for n from width to size
  size - (n - 1)

repetition-max

the maximum possible count of subsequence repetitions for one subsequence width

unique-max - 1

repetition-all-max

like repetition-max but for all possible widths

unique-all-max - size

measuring

repetition

iterate over all sub-sequences or width in sequence
use a set data structure to find reoccuring sequences
count reoccurences

as a ratio

a ratio of repetition between 0 and 1

count / repetition-max

repetition-all

like for single width repetition but repeated for each possible width and summed

as a ratio

a ratio of repetition between 0 and 1, normalised by the number of possible widths

count / size

increase

replace rare sequences with more frequent ones, using only existing scalars

algorithm

approximation, might not reach target. works most reliable with width 2. can gradually increase repetition.

input
  $a: the main sequence to modify
  $width: the width of subsequences to consider
  $target: the desired result repetition
limits
  repetition-max
from $a create a collection $counted-sequences ((sequence count) ...)
sort by count ascending
for pairs of $counted-sequences
replace occurences of the first with the last, second with the second last, and so on

special cases

width = size

there can only be one sequence of this size, the main parent sequence

more repetition than distinct scalars

for example for maximum repetition, a random element can be selected to replace all other elements

large width

target might never be reached because there are no adequate patterns to replace rare ones, without affecting common ones

decrease

replace sequences with ones not yet included

algorithm

approximation, might not reach target. works most reliable with width 2

input
  $a: the main sequence to modify
  $width: the width of subsequences to consider
  $target: the desired result repetition
limits
  the minimum possible repetition
    unique-max-set: the maximum number of distinct selections of $width from the set of scalar values in $a
    min-repetition: unique-max - minimum(unique-max, unique-max-set)
sequence creation
  deduplicate $a to create a set in an array
  shuffle the set to create more evenly distributed patterns
  count an array of indices like digits of a number
  select from set at indices to create a new pattern
  repeat if the pattern is already included in $a
counting
  from $a create a collection $counted-sequences ((sequence count) ...)
  sort by count descending
replacing
  for elements of $counted-sequences
  replace occurences with a new sequence
  replace until all occurences are replaced or the difference to the target is zero
  recount and repeat until target reached

special cases

width = 1

sequences of width 1 are scalars and there is usually a different process for counting scalars compared to other sequences

notes

alternatives

especially for extreme target values, permutations (decrease) or setting ranges to the same value (increase) might be preferable

more definitive algorithms

the problem with the presented algorithms is that when one sequence is replaced, others are affected because sequences are overlapping. all overlapping combinations of all sequences might have to be considered

on float values

for use on floating point values, either convert to integers or compare with a minimum acceptable delta. the same values can be represented in different ways using floating point which is why equality can fail, and small differences might be from rounding errors

implementations

the approximation algorithms have been implemented in c for sph-sp