Repeated DNA Sequences

Find every 10-letter DNA substring that appears more than once via fixed-size sliding window + hash set.

Medium
Time O(n) Space O(n)
LeetCode
3 min read
sliding-window hash-set string

Problem#

Return all 10-letter-long substrings of a DNA string s that appear more than once.

Examples#

  • "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"["AAAAACCCCC", "CCCCCAAAAA"]

Constraints#

  • 1 <= s.length <= 10^5, characters A/C/G/T.

Approach#

Slide a fixed-size window of length 10. Use a seen set; when a 10-char substring is seen for the second time, record it in an output set (to dedupe further repeats).

Solution#

Repeated DNA Sequences
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
if (s.size() < 10) return {};
unordered_set<string> seen, out;
for (int i = 0; i + 10 <= (int)s.size(); ++i) {
string sub = s.substr(i, 10);
if (!seen.insert(sub).second) out.insert(sub);
}
return {out.begin(), out.end()};
}
};

Editorial#

The classic unordered_set::insert returns {iter, true} on insertion and {iter, false} on a hit — perfect for “first occurrence vs not”. The double-set pattern dedups outputs (a substring repeated 5 times only appears once in the result). For optimal performance, encode each 10-char substring as a 20-bit integer (2 bits per base) and use a single unordered_set<int>.

Complexity#

  • Time: O(n).
  • Space: O(n).

Concept revision#

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.