Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of $k$-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same. One way to find $k$-mer orderings for minimizer schemes is through the use of universal $k$-mer sets, which are subsets of $k$-mers that are guaranteed to cover all windows. The smaller this set the fewer false positives (where two poorly aligned sequences being identified as possible matches) are identified. Current methods for creating universal $k$-mer sets are limited in the length of the $k$-mer that can be considered, and cannot compute sets in the range of lengths currently used in practice. We take some of the first steps in creating universal $k$-mer sets that can be used to construct minimizer orders for large values of $k$ that are practical. We do this using iterative extension of the $k$-mers in a set, and guided contraction of the set itself. We also show that this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and thus can only decrease the number of false positives over using the current sets on small $k$-mers.
This project was started by Fiyinfoluwa Gbosibo during a summer internship at CMU. She won best student paper award from ACM Sigbio for this work. Congratulations.