Sequence sets covering all k-mers are instrumental in many biological technologies to measure interactions at an unbiased and universal manner. While increasing k is desirable, the number of k-mers increases exponentially with it. Since the space on the experimental device is limited, an inherent tradeoff exists between k and the size of the library. Here, we introduce a novel idea of using joker characters, which represent all letter in the alphabet. We develop a new algorithm to generate compact sequence sets that cover all k-mers using these joker characters in a limited fashion, to avoid introducing too much degeneracy. The algorithms is based on two steps: (i) a greedy heuristic; (ii) ILP formulation. Results show that our algorithm can produce sequence sets that are near optimal.
If you would like to contact us about our work, please scroll down to the people section and click on one of the group leads' people pages, where you can reach out to them directly.