this post was submitted on 07 Oct 2024
1 points (100.0% liked)

It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
 
The original post: /r/datahoarder by /u/CorvusRidiculissimus on 2024-10-06 19:06:36.

I'm working on a new deduplication program. It's still experimental right now so I don't want to release it publicly, and it's currently linux-only. Though you can probably compile it for windows - the only part that might give you problems is the memory mapped IO. Anyone want to try it and evaluate?

It's a format-agnostic dedup. Short version is that it looks for files which share long stretches of data. It'll pick up things like multiple edits of one document, or multiple archives that all contain the same file (providing they use identical compression settings), or different versions of a file that differ in metadata but share the data in common. It'll scan a large collection of files, do some math, and return you a list of the ones that look similar.

Anyway, code.

https://pastebin.com/GNYGhZCf

This is barely-tested. It's probably full of memory leaks and such. It's a proof of concept. And yes, Rabin would speed it up a little, but I'm struggling to learn the math behind that.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here