Repeated DNA Sequences
Find every 10-letter DNA substring that appears more than once via fixed-size sliding window + hash set.
3 min read
sliding-window hash-set string
Problem#
Return all 10-letter-long substrings of a DNA string s that appear more than once.
Examples#
"AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"→["AAAAACCCCC", "CCCCCAAAAA"]
Constraints#
1 <= s.length <= 10^5, charactersA/C/G/T.
Approach#
Slide a fixed-size window of length 10. Use a seen set; when a 10-char substring is seen for the second time, record it in an output set (to dedupe further repeats).
Solution#
class Solution {public: vector<string> findRepeatedDnaSequences(string s) { if (s.size() < 10) return {}; unordered_set<string> seen, out; for (int i = 0; i + 10 <= (int)s.size(); ++i) { string sub = s.substr(i, 10); if (!seen.insert(sub).second) out.insert(sub); } return {out.begin(), out.end()}; }};func findRepeatedDnaSequences(s string) []string { if len(s) < 10 { return []string{} } seen := map[string]bool{} out := map[string]bool{} for i := 0; i+10 <= len(s); i++ { sub := s[i : i+10] if seen[sub] { out[sub] = true } else { seen[sub] = true } } res := make([]string, 0, len(out)) for k := range out { res = append(res, k) } return res}from typing import List
class Solution: def findRepeatedDnaSequences(self, s: str) -> List[str]: if len(s) < 10: return [] seen: set[str] = set() out: set[str] = set() for i in range(len(s) - 9): sub = s[i:i + 10] if sub in seen: out.add(sub) else: seen.add(sub) return list(out)function findRepeatedDnaSequences(s) { if (s.length < 10) return []; const seen = new Set(); const out = new Set(); for (let i = 0; i + 10 <= s.length; i++) { const sub = s.substring(i, i + 10); if (seen.has(sub)) out.add(sub); else seen.add(sub); } return Array.from(out);}class Solution { public List<String> findRepeatedDnaSequences(String s) { if (s.length() < 10) return new ArrayList<>(); Set<String> seen = new HashSet<>(); Set<String> out = new HashSet<>(); for (int i = 0; i + 10 <= s.length(); i++) { String sub = s.substring(i, i + 10); if (!seen.add(sub)) out.add(sub); } return new ArrayList<>(out); }}function findRepeatedDnaSequences(s: string): string[] { if (s.length < 10) return []; const seen = new Set<string>(); const out = new Set<string>(); for (let i = 0; i + 10 <= s.length; i++) { const sub = s.substring(i, i + 10); if (seen.has(sub)) out.add(sub); else seen.add(sub); } return Array.from(out);}Editorial#
The classic unordered_set::insert returns {iter, true} on insertion and {iter, false} on a hit — perfect for “first occurrence vs not”. The double-set pattern dedups outputs (a substring repeated 5 times only appears once in the result). For optimal performance, encode each 10-char substring as a 20-bit integer (2 bits per base) and use a single unordered_set<int>.
Complexity#
- Time: O(n).
- Space: O(n).
Concept revision#
Related#