# measuring, increasing and decreasing subsequence repetition # calculating ## unique-max the maximum possible count of unique overlapping subsequences for one subsequence width in a parent sequence of size. size must be equal or greater than width. ~~~ size - (width - 1) ~~~ ## unique-all-max like unique-max but for all possible widths ~~~ sum for n from width to size size - (n - 1) ~~~ ## repetition-max the maximum possible count of subsequence repetitions for one subsequence width ~~~ unique-max - 1 ~~~ ## repetition-all-max like repetition-max but for all possible widths ~~~ unique-all-max - size ~~~ # measuring ## repetition ~~~ iterate over all sub-sequences or width in sequence use a set data structure to find reoccuring sequences count reoccurences ~~~ ## as a ratio a ratio of repetition between 0 and 1 ~~~ count / repetition-max ~~~ ## repetition-all like for single width repetition but repeated for each possible width and summed ### as a ratio a ratio of repetition between 0 and 1, normalised by the number of possible widths ~~~ count / size ~~~ # increase replace rare sequences with more frequent ones, using only existing scalars ## algorithm approximation, might not reach target. works most reliable with width 2. can gradually increase repetition. ~~~ input $a: the main sequence to modify $width: the width of subsequences to consider $target: the desired result repetition limits repetition-max from $a create a collection $counted-sequences ((sequence count) ...) sort by count ascending for pairs of $counted-sequences replace occurences of the first with the last, second with the second last, and so on ~~~ ## special cases ### width = size there can only be one sequence of this size, the main parent sequence ### more repetition than distinct scalars for example for maximum repetition, a random element can be selected to replace all other elements ### large width target might never be reached because there are no adequate patterns to replace rare ones, without affecting common ones # decrease replace sequences with ones not yet included ## algorithm approximation, might not reach target. works most reliable with width 2 ~~~ input $a: the main sequence to modify $width: the width of subsequences to consider $target: the desired result repetition limits the minimum possible repetition unique-max-set: the maximum number of distinct selections of $width from the set of scalar values in $a min-repetition: unique-max - minimum(unique-max, unique-max-set) sequence creation deduplicate $a to create a set in an array shuffle the set to create more evenly distributed patterns count an array of indices like digits of a number select from set at indices to create a new pattern repeat if the pattern is already included in $a counting from $a create a collection $counted-sequences ((sequence count) ...) sort by count descending replacing for elements of $counted-sequences replace occurences with a new sequence replace until all occurences are replaced or the difference to the target is zero recount and repeat until target reached ~~~ ## special cases ### width = 1 sequences of width 1 are scalars and there is usually a different process for counting scalars compared to other sequences # notes ## alternatives especially for extreme target values, permutations (decrease) or setting ranges to the same value (increase) might be preferable ## more definitive algorithms the problem with the presented algorithms is that when one sequence is replaced, others are affected because sequences are overlapping. all overlapping combinations of all sequences might have to be considered ## on float values for use on floating point values, either convert to integers or compare with a minimum acceptable delta. the same values can be represented in different ways using floating point which is why equality can fail, and small differences might be from rounding errors ## implementations the approximation algorithms have been implemented in c for [sph-sp](https://github.com/sph-mn/sph-sp)