29.9. 29.9 -complexity of one-dimensional arrays

In this section a new complexity measure, -complexity is studied. This measure is intended to express the total quantity of information included in a sequence counting the different -subsequences (belonging to the given set of sequences) of the investigated sequence. The background of the new complexity measure lies in biology and chemistry. Some natural sequences, as amino-acid sequences in proteins or nucleotid sequences in DNA-molecules have a winding structure [ 61 ], [ 66 ] and some bends can be cut forming new sequences. The distance parameter is the bound for the length of bends.

We use the basic concepts and notations of the theory of algorithms [ 57 ], formal languages [ 242 ] and graphs [ 27 ].

29.9.1. 29.9.1 Definitions

Let and be positive integers, be an alphabet, the set of -length words, the set of finite words and the set of finite nonempty words over . The length of a word is denoted by .

Let be nonnegative integers. Then the multialphabet is the multiset , the set of words and containing at most times , the set of words containing at most times . Alphabets can be considered as special multialphabets with infinite multiplicityes of the letters.

Nonempty sets of words — e.g. — are called languages. Set of -length elements of a language is denoted by .

Definition 29.9 Let and be positive integers, and words such that and . If , then is a d-subword of , if and only if there exists a sequence with such that . If , then is a -subword of , if and only if there exists a sequence with such that . If for given and there exist several such sequences (or indices), then the sequence belonging to and is the lexicographically minimal of them. The differences are called jumps. 1-length words contain a jump of size zero.

According to this definition only nonempty words can be -subwords.

Now we classify the -subwords of a given word according to their length, position of their last letter and the length of their longest jump.

Definition 29.10 Let and be positive integers, a multialphabet, a multilanguage, a word, denote the -length subwords of belonging to , denote the -subwords of for which the first appearance ends with , the set of subwords of for which the largest jump in its first occurrence in equals to (everywhere ). The -complexity function of is

the -characteristic function of is

the -jump function of is

the total -complexity of is

The special case of was introduced by M. Heinz [ 110 ] in 1977 and studied later by many authors (see e.g. the papers [ 4 ], [ 74 ], [ 251 ]. The special case of was introduced in 1984 [ 117 ] and studied later e.g. in [ 125 ], [ 141 ].

Example 29.1 Let be an alphabet, . Classification according to the length of the subwords results , , , , , , and so , , .

Classification according to the last letter gives , , and and so .

Classification according to the largest jumps results , , and so .

Let . The multisubwords can contain at most one and so . Decreasing of some complexity values is natural since bounded multiplicityes exclude some subwords.

Definition 29.11 Let be a finite language over and be a word. is a -covering word of if and only if contains all elements of as a d-subword. The length of the shortest -covering word of is denoted by . Such words are called minimal. If , then we use instead of .

The special case is well-known and widely studied as superstring problem [ 290 ].

Definition 29.12 Let and be positive integers, be an alphabet and be a language over . The maximum of the -complexities of the elements of is denoted by . Words having complexity are called maximal. If , then we use instead of . Minimal and maximal words are called extremal.

Definition 29.13 Let be a sequence of languages over some alphabet multialphabet with the property . Such sequences of languages are called family.

Definition 29.14 Let and be positive integers, an alphabet, a family of languages and be a word over . is a -superminimal word of the family if and only if the word is a -covering word of for .

Definition 29.15 Let be a positive integer, be an alphabet, be a positive integer and be a word over . is -supermaximal word of the alphabet if and only if the word is a -maximal word for .

Definition 29.16 The words consisting of identical letters are called homogeneous, the words consisting of different letters are called rainbow and the words consisting of copies of a given rainbow word (the last copy can be only a prefix) are called cyclical.

29.9.2. 29.9.2 Bounds of complexity measures

In this section we formulate some simple properties of the investigated complexity measures.

Lemmas 29.17, 29.18, 29.19 and 29.20 characterize the monotonity properties of the functions and .

Lemmas 29.21, 29.22, 29.23 and 29.24 give some bounds of the functions and .

Lemma 29.17 If is a positive integer, is a language and is a word, then the -complexity function is a monotone nondecreasing function of the distance , that is

and in some cases the equality holds.

Proof. implies the relation. E.g. if and the first, second and last letters of a word are identical, then and generally if , then in (29.12) equality holds for any .

Lemma 29.18 If and are positive integers and , then is a trapezoidal function of , that is there exist a positive integer and a nonnegative integer such that

and

Special case of this assertion was recently proved by de Luca [Lu]. Since only words belonging to are counted as -subwords, the function is not always trapezoidal. E.g. if an alphabet, is a nonnegative integer, and a covering word of , then the function has local maxima, since now equals to for and equals zero otherwise.

Lemma 29.19 is a language, then the -characteristic function is a nondecreasing function that is

and if for some finite alphabet , then the -characteristic function is an increasing function of that is

Lemma 29.20 If is a positive integer, is an alphabet and is a language, then the cumulated values of the functions and are increasing functions of that is

Proof. The corresponding and are nonempty sets.

E.g. if is empty, then the functions and are only nondecreasing ones.

Lemma 29.21 Let and , then

and if and , then

The lower bounds are sharp. The upper bound is sharp if and only if .

Lemma 29.22 Let and , then

and if and , then

The lower bounds are sharp. The upper bound is sharp if and only if .

Lemma 29.23 Let and , then

Lemma 29.24 If is a positive integer and , then

The lower bound is sharp if and only if is homogeneous. The upper bound is sharp.

The first proof of Lemma 29.24 is due to Heinz [ 110 ]. Shallit [ 251 ] has proved that in the case the upper bound can be achieved. M.-C. Anisiu [ 4 ] proposed an algorithm constructing for any positive an -length binary word with the following property: if , then either contains all -length binary words as subword or all of its -length subwords are different.

The jump function characterizes the decreasing of the complexity due to the decreasing of .

Lemma 29.25 Let and be positive integers, . Then

The bounds are sharp if and only if is a rainbow word.

Proof. If , then at most one jump of length is possible. We have positions to start the jump and can choose or not the remaining letters.

If , then we have either two jumps of length or one . Due to the second jump we loss a further subword comparing with the previous case.

If , then we have place also for one or two -length jumps. The exponential part of the formula counts 2 subwords containing jumps of length and and counts twice 5 subwords containing two jumps of length so we loss 7 subwords.

If , then also one or two jumps of length at most are permitted. We loss 2 subwords containing jumps of length and , 6 subwords containing jumps of length and and all together 6 subwords containing two jumps both of length .

29.9.3. 29.9.3 Recurrence relations

The following recurrence properties of the complexity functions and are useful in the later analysis. In this section always and , therefore the parameters and are omitted.

Lemma 29.26 If , then

where , and

Corollary 29.27 If is a crossbow word and , then

Corollary 29.28 If is an infinite cyclical word and , then

The next three lemmas contain useful results of the theory of linear recurrence equations. The first one [ 153 ] gives the solution of the -order inhomogeneous recurrence relation, the second one [ 124 ] characterizes the roots of the corresponding characteristic equations and the third one [ 124 ] sums the values of the solution function.

Lemma 29.29 If is a positive integer, then the solution of the inhomogeneous linear recurrence equation

is

where the constants can be computed using the given initial values and the values can be estimated using the next lemma.

Lemma 29.30 If , then let be the roots of the -order algebraic equation

in decreasing order of their absolute values. Then

and

Lemma 29.31 If

then

29.9.4. 29.9.4 Pseudocode of the algorithm Quick-Martin

Algorithm Martin generates one-dimensional perfect arrays. Its inputs are the alphabet size and the window size . Its output is an -ary perfect sequence of length . The output begins with zeros and always continues with the maximal permitted element of the alphabet.

A natural implementation of Martin's algorithm can be found in the chapter Complexity of words of this book. The following effective implementation of Martin is due to M. Horváth and A. Iványi [ 115 ].

29.9.5. 29.9.5 Pseudocode of algorithm - Complexity

At first we present an algorithm - Complexity which estimates the -complexity of words.

Input: the size of the alphabet,

the distance parameter,

the length of the investigated word,

the letters of the investigated word.

Output: the -complexity of the investigated word.

Working variables: position of the last occurrence of letters .

starting index of the summing.

d-Complexity( )

  1  
                     FOR
                   
                   
                  
                     TO
                   
                     2       3     4     5  
                     FOR
                   
                   
                  
                     TO
                   
                     6     7     8  
                     FOR
                   
                   
                  
                     TO
                   
                     9      10       
                     IF
                   
                    11            12      13  
                     RETURN
                   
                   
               

The -complexity of the investigated word is greater or equal to the output of this algorithm. If all letters of are different or if is a 1-subword of the word or , then the output of the algorithm is the exact complexity of .

The running time of this algorithm is .

29.9.6. 29.9.6 Pseudocode of algorithm Super

The following algorithm produces a superperfect word [ 124 ], [ 125 ], [ 283 ].

Input: the size of the alphabet,

the length of the window.

Output: an infinite superperfect word .

Super( )

  1     2     3  
                     WHILE
                   
                     4    Continue  to get an Eulerian circuit of De Bruijn graph    5     6  
                     RETURN
                   
                   
               

In line 4 of Complex we use algorithm Bruijn [ 125 ].

29.9.7. 29.9.7 Pseudocode of algorithm MaxSub

Algorithm MaxSub generates a word of given length having maximal subwordcomplexity.

Input: the length of the word to be generated;

: the size of the alphabet.

Output: : the letters of a maximal word .

Working variables: counters of the vertices of de Bruijn graph .

MaxSub( )

  1     2  
                     IF
                   
                     3       4  
                     Optimal-Martin(
                     
                     )
                     5  
                     IF
                   
                     6    
                     RETURN
                   
                     7  Add 0 to the Martin-word and according to the new word draw a path in the De Bruijn graph    8  Construct cycles in  using the remaining edges following Martin principle   9  Order these cycles in decreasing order of length. Let .  10  Transform  at first into a cyclical De Bruijn word, then back into a De Bruijn word  ending at the starting vertex of .  11  Add  to  receiving the letters   12  
                     IF
                   the length of the new word is greater or equal to   13    
                     RETURN
                    14  
                     FOR
                   
                   
                  
                     TO
                   
                    15    Insert  in the constructed word  16    
                     IF
                   the length of the new word is greater or equal to   17       
                     RETURN
                   
                   
               

The running time of MaxSub is .

29.9.8. 29.9.8 Construction and complexity of extremal words

Due to the finiteness of the corresponding sets -extremal words of languages exist in all cases. Similar assertion does not hold for maximal and minimal words of families of languages.

Theorem 29.32 and 29.34 contain general bounds of -complexity, resp. 1-complexity. Theorem 29.35 characterizes the maximal -complexity for medium distance and implies the surprising corollary that binary alphabet is sufficient to get the same order of maximal subword complexity what can be achieved using an infinite alphabet.

Theorem 29.37 and 29.38 give the maximal complexity for different values of the parameters. Theorem 29.40 contains the condition of the existence of -superminimal words. Theorem 29.42 gives the complexity of cyclical words and together with Theorem 29.37 has the corollary that and have the same order of growth. Theorem 29.42 gives the that the -length words can be covered using only letters needed to cover the words and so illustrates the decreasing of the length of minimal -covering words due to the permitted jumps in the construction.

Theorem 29.32 (Iványi [ 124 ]) If and are positive integers, is an alphabet and , then

The lower bound is tight if and only if is homogeneous. The upper bound is sharp if and only if is a rainbow word and .

Proof. Since holds for , the lower bound is correct and for the equality holds. The upper bound can be proved using equation (29.21) and inequality or directly exploiting that an -element set has nonempty subsets. If and , then belongs to and contains different elements of as a -subword so the upper bound is also tight. From the other side if , then contains repeated letters and so , and if , then the first and last letters of do not form -subsequence and so .

Corollary 29.33 If and are positive integers, then

and if , then

Theorem 29.34 If and are positive integers, is an alphabet and , then

The lower bound is sharp if is homogeneous. The upper bound is sharp if and only if is a rainbow word.

Proof. We have places for the 2 delimiters determining a 1-subword of . If , then all 1-subwords of the word are different. From the other side if , then has to contain repeated letters not representing different 1-subwords.

Theorem 29.35 If and are positive integers and , then

where is the unique integer satisfying and so .

Proof. Summing the inequalities (29.25) for results

In the sum for small values of the exponential part, for large values of the second part gives the minimum. Therefore concrete calculations result the formula of the theorem as a bound. Algorithm MaxSb produces a suitable word for any possible pair of achieving this bound.

In the binary case Shallit [ 251 ] has proved the existence of words with such complexity. Using alphabet containing letters and algorithm Super we constructed [ 124 ] infinite words with .

Corollary 29.36 If , then

Theorem 29.37 If and are positive integers, and , then

If , then computation of the constants in Theorem 29.35 requires only technical steps, otherwise there are algebraic difficulties. Z. Kása [ 141 ] has found another recurrence relation and gave explicit formulas for . Using the jump function we reprove his result and extend it for some smaller values of .

Theorem 29.38 (Kása [ 141 ]) If and are positive integers and , then

where and so

Proof. The formulas for and appeared in [ 124 ], the case is a special case of Theorem 29.35.

If , then according to Lemma 29.25 the sum of the jumping values of a rainbow word for results our formula.

The last 3 formulas also can be derived using Lemma 29.25.

Figure 29.6 shows the values of the jump function for rainbow words. The first and second columns are computed using the formulas and , the last columns using the formulas for and the medium values (printed by bold digits) using the recurrence relation. These data support the conjecture that the jump function is trapezoidal.

Figure 29.6.  Values of jumping function of rainbow words of length Values of jumping function of rainbow words of length 1,2,\ldots,10 ..

Values of jumping function of rainbow words of length 1,2,\ldots,10 .

Theorem 29.39 If is a positive integer, is an alphabet, for and , then has a 1-superminimal word if and only if .

The next theorem shows that the maximal -complexity is exponential even over a binary alphabet and .

Theorem 29.40 Let and be positive integers and be an infinite cyclical word over the alphabet and . Then

If , then

and so

Proof. In this case for the -characteristic function we get a Fibonacci-type recursion with initial values and . The sum of the corresponding characteristic values gives the total complexity.

Corollary 29.41 If and are positive integers, and , then

Theorem 29.42 If and are positive integers, and , then

and is a minimal -covering word of .

Corollary 29.43 If and are positive integers, and , then and is a minimal -covering word of .

Exercises

29.9-1 Construct a superperfect word with maximal window size 4 for the alphabet .