We developed a new algorithm to generate compact sequence sets covering all k-mers using joker characters.
Sequence sets covering all k-mers are instrumental in many biological technologies to measure interactions at an unbiased and universal manner. While increasing k is desirable, the number of k-mers increases exponentially with it. Since the space on the experimental device is limited, an inherent tradeoff exists between k and the size of the library. Here, we introduce a novel idea of using joker characters, which represent all letter in the alphabet. We develop a new algorithm to generate compact sequence sets that cover all k-mers using these joker characters in a limited fashion, to avoid introducing too much degeneracy. The algorithms is based on two steps: (i) a greedy heuristic; (ii) ILP formulation. Results show that our algorithm can produce sequence sets that are near optimal.