In this section a new complexity measure, -complexity is studied. This measure is intended to express the total quantity of information included in a sequence counting the different -subsequences (belonging to the given set of sequences) of the investigated sequence. The background of the new complexity measure lies in biology and chemistry. Some natural sequences, as amino-acid sequences in proteins or nucleotid sequences in DNA-molecules have a winding structure [ 61 ], [ 66 ] and some bends can be cut forming new sequences. The distance parameter is the bound for the length of bends.
Let and be positive integers, be an alphabet, the set of -length words, the set of finite words and the set of finite nonempty words over . The length of a word is denoted by .
Let be nonnegative integers. Then the multialphabet is the multiset , the set of words and containing at most times , the set of words containing at most times . Alphabets can be considered as special multialphabets with infinite multiplicityes of the letters.
Nonempty sets of words — e.g. — are called languages. Set of -length elements of a language is denoted by .
Let and be positive integers, and words such that and . If , then is a
of , if and only if there exists a sequence with such that . If , then is a -subword of , if and only if there exists a sequence with such that . If for given and there exist several such sequences (or indices), then the sequence belonging to and is the lexicographically minimal of them. The differences are called jumps. 1-length words contain a jump of size zero.
According to this definition only nonempty words can be -subwords.
Now we classify the -subwords of a given word according to their length, position of their last letter and the length of their longest jump.
Definition 29.10 Let and be positive integers, a multialphabet, a multilanguage, a word, denote the -length subwords of belonging to , denote the -subwords of for which the first appearance ends with , the set of subwords of for which the largest jump in its first occurrence in equals to (everywhere ). The -complexity function of is
the -characteristic function of is
the -jump function of is
the total -complexity of is
The special case of was introduced by M. Heinz [ 110 ] in 1977 and studied later by many authors (see e.g. the papers [ 4 ], [ 74 ], [ 251 ]. The special case of was introduced in 1984 [ 117 ] and studied later e.g. in [ 125 ], [ 141 ].
Classification according to the last letter gives , , and and so .
Classification according to the largest jumps results , , and so .
Let . The multisubwords can contain at most one and so . Decreasing of some complexity values is natural since bounded multiplicityes exclude some subwords.
Definition 29.11 Let be a finite language over and be a word. is a -covering word of if and only if contains all elements of as a d-subword. The length of the shortest -covering word of is denoted by . Such words are called minimal. If , then we use instead of .
The special case is well-known and widely studied as superstring problem [ 290 ].
Definition 29.12 Let and be positive integers, be an alphabet and be a language over . The maximum of the -complexities of the elements of is denoted by . Words having complexity are called maximal. If , then we use instead of . Minimal and maximal words are called extremal.
Definition 29.16 The words consisting of identical letters are called homogeneous, the words consisting of different letters are called rainbow and the words consisting of copies of a given rainbow word (the last copy can be only a prefix) are called cyclical.
In this section we formulate some simple properties of the investigated complexity measures.
and in some cases the equality holds.
Proof. implies the relation. E.g. if and the first, second and last letters of a word are identical, then and generally if , then in (29.12) equality holds for any .
Special case of this assertion was recently proved by de Luca [Lu]. Since only words belonging to are counted as -subwords, the function is not always trapezoidal. E.g. if an alphabet, is a nonnegative integer, and a covering word of , then the function has local maxima, since now equals to for and equals zero otherwise.
and if for some finite alphabet , then the -characteristic function is an increasing function of that is
E.g. if is empty, then the functions and are only nondecreasing ones.
and if and , then
The lower bounds are sharp. The upper bound is sharp if and only if .
and if and , then
The lower bounds are sharp. The upper bound is sharp if and only if .
The lower bound is sharp if and only if is homogeneous. The upper bound is sharp.
The first proof of Lemma 29.24 is due to Heinz [ 110 ]. Shallit [ 251 ] has proved that in the case the upper bound can be achieved. M.-C. Anisiu [ 4 ] proposed an algorithm constructing for any positive an -length binary word with the following property: if , then either contains all -length binary words as subword or all of its -length subwords are different.
The jump function characterizes the decreasing of the complexity due to the decreasing of .
The bounds are sharp if and only if is a rainbow word.
If , then we have either two jumps of length or one . Due to the second jump we loss a further subword comparing with the previous case.
If , then we have place also for one or two -length jumps. The exponential part of the formula counts 2 subwords containing jumps of length and and counts twice 5 subwords containing two jumps of length so we loss 7 subwords.
If , then also one or two jumps of length at most are permitted. We loss 2 subwords containing jumps of length and , 6 subwords containing jumps of length and and all together 6 subwords containing two jumps both of length .
The following recurrence properties of the complexity functions and are useful in the later analysis. In this section always and , therefore the parameters and are omitted.
where , and
The next three lemmas contain useful results of the theory of linear recurrence equations. The first one [ 153 ] gives the solution of the -order inhomogeneous recurrence relation, the second one [ 124 ] characterizes the roots of the corresponding characteristic equations and the third one [ 124 ] sums the values of the solution function.
where the constants can be computed using the given initial values and the values can be estimated using the next lemma.
in decreasing order of their absolute values. Then
generates one-dimensional perfect arrays. Its inputs are the alphabet size and the window size . Its output is an -ary perfect sequence of length . The output begins with zeros and always continues with the maximal permitted element of the alphabet.
A natural implementation of Martin's algorithm can be found in the chapter Complexity of words of this book. The following effective implementation of
is due to M. Horváth and A. Iványi [
At first we present an algorithm -
which estimates the -complexity of words.
Input: the size of the alphabet,
the distance parameter,
the length of the investigated word,
the letters of the investigated word.
Output: the -complexity of the investigated word.
Working variables: position of the last occurrence of letters .
starting index of the summing.
TO2 3 4 5
TO6 7 8
IF11 12 13
The -complexity of the investigated word is greater or equal to the output of this algorithm. If all letters of are different or if is a 1-subword of the word or , then the output of the algorithm is the exact complexity of .
The running time of this algorithm is .
Input: the size of the alphabet,
the length of the window.
Output: an infinite superperfect word .
1 2 3
WHILE4 Continue to get an Eulerian circuit of De Bruijn graph 5 6
In line 4 of
we use algorithm
generates a word of given length having maximal subwordcomplexity.
Input: the length of the word to be generated;
: the size of the alphabet.
Output: : the letters of a maximal word .
Working variables: counters of the vertices of de Bruijn graph .
RETURN7 Add 0 to the Martin-word and according to the new word draw a path in the De Bruijn graph 8 Construct cycles in using the remaining edges following Martin principle 9 Order these cycles in decreasing order of length. Let . 10 Transform at first into a cyclical De Bruijn word, then back into a De Bruijn word ending at the starting vertex of . 11 Add to receiving the letters 12
IFthe length of the new word is greater or equal to 13
TO15 Insert in the constructed word 16
IFthe length of the new word is greater or equal to 17
The running time of
Due to the finiteness of the corresponding sets -extremal words of languages exist in all cases. Similar assertion does not hold for maximal and minimal words of families of languages.
Theorem 29.32 and 29.34 contain general bounds of -complexity, resp. 1-complexity. Theorem 29.35 characterizes the maximal -complexity for medium distance and implies the surprising corollary that binary alphabet is sufficient to get the same order of maximal subword complexity what can be achieved using an infinite alphabet.
Theorem 29.37 and 29.38 give the maximal complexity for different values of the parameters. Theorem 29.40 contains the condition of the existence of -superminimal words. Theorem 29.42 gives the complexity of cyclical words and together with Theorem 29.37 has the corollary that and have the same order of growth. Theorem 29.42 gives the that the -length words can be covered using only letters needed to cover the words and so illustrates the decreasing of the length of minimal -covering words due to the permitted jumps in the construction.
Theorem 29.32 (Iványi [ 124 ]) If and are positive integers, is an alphabet and , then
The lower bound is tight if and only if is homogeneous. The upper bound is sharp if and only if is a rainbow word and .
Proof. Since holds for , the lower bound is correct and for the equality holds. The upper bound can be proved using equation (29.21) and inequality or directly exploiting that an -element set has nonempty subsets. If and , then belongs to and contains different elements of as a -subword so the upper bound is also tight. From the other side if , then contains repeated letters and so , and if , then the first and last letters of do not form -subsequence and so .
and if , then
The lower bound is sharp if is homogeneous. The upper bound is sharp if and only if is a rainbow word.
Proof. We have places for the 2 delimiters determining a 1-subword of . If , then all 1-subwords of the word are different. From the other side if , then has to contain repeated letters not representing different 1-subwords.
where is the unique integer satisfying and so .
Proof. Summing the inequalities (29.25) for results
In the sum for small values of the exponential part, for large values of the second part gives the minimum. Therefore concrete calculations result the formula of the theorem as a bound. Algorithm
produces a suitable word for any possible pair of achieving this bound.
If , then computation of the constants in Theorem 29.35 requires only technical steps, otherwise there are algebraic difficulties. Z. Kása [ 141 ] has found another recurrence relation and gave explicit formulas for . Using the jump function we reprove his result and extend it for some smaller values of .
Theorem 29.38 (Kása [ 141 ]) If and are positive integers and , then
where and so
If , then according to Lemma 29.25 the sum of the jumping values of a rainbow word for results our formula.
The last 3 formulas also can be derived using Lemma 29.25.
Figure 29.6 shows the values of the jump function for rainbow words. The first and second columns are computed using the formulas and , the last columns using the formulas for and the medium values (printed by bold digits) using the recurrence relation. These data support the conjecture that the jump function is trapezoidal.
The next theorem shows that the maximal -complexity is exponential even over a binary alphabet and .
If , then
and is a minimal -covering word of .
29.9-1 Construct a superperfect word with maximal window size 4 for the alphabet .