Find Duplicate File in System

Group files by content using a hash map keyed on the content blob.

Medium
Time O(N * L) Space O(N * L)
LeetCode
4 min read
hash-map strings parsing

Problem#

Given a list of strings describing directories and their files in the form "path/dir name1(content1) name2(content2) ...", return groups of file paths sharing identical content. Singleton groups are excluded.

Examples#

  • ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd) 4.txt(efgh)","root/c/d 5.txt(efgh)"][["root/a/1.txt","root/c/3.txt"],["root/a/2.txt","root/c/4.txt","root/c/d/5.txt"]].

Constraints#

  • 1 <= paths.length <= 2 * 10^4, file names and contents are alphanumeric.

Hints#

Hint
Parse each entry into (path, file, content) triples, then bucket by content.

Approach#

Parse each input string: first token is the directory, the rest are name(content). Strip the content out (substring between parens) and bucket dir + "/" + name by content in a hash map. Return only buckets of size >= 2.

Solution#

Find Duplicate File in System
class Solution {
public:
vector<vector<string>> findDuplicate(vector<string>& paths) {
unordered_map<string, vector<string>> buckets;
for (auto& entry : paths) {
int sp = entry.find(' ');
string dir = entry.substr(0, sp);
int i = sp + 1, n = entry.size();
while (i < n) {
int lp = entry.find('(', i);
int rp = entry.find(')', lp);
string name = entry.substr(i, lp - i);
string content = entry.substr(lp + 1, rp - lp - 1);
buckets[content].push_back(dir + "/" + name);
i = rp + 2; // skip ") "
}
}
vector<vector<string>> ans;
for (auto& [_, v] : buckets) if (v.size() > 1) ans.push_back(std::move(v));
return ans;
}
};

Editorial#

The hash map bucketed by content gives the answer in one pass. Parsing the awkward “path file(content) file(content) …” format is straightforward with find() and substr() — the i = rp + 2 step jumps over the closing ) and the separating space. For very large files, store a SHA hash of the content instead of the content itself.

Complexity#

  • Time: O(N * L).
  • Space: O(N * L).

Concept revision#

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.