this post was submitted on 20 Jun 2025
1 points (100.0% liked)

It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
 
The original post: /r/datahoarder by /u/QLaHPD on 2025-06-20 07:55:02.

Guys, so I'm building a dataset of YouTube comments, I'm trying to be as diverse as possible, taking many types of channels as possible, and, as you can imagine lots and lots of comments are duplicated/spam.

I know this topic isn't only about r/DataHoarder but I guess its worth posting here too, should I keep all comments or remove duplication leaving only the first copy of each?

I thought on these pros and cons:

Pros on keep:

  • Spam information, which comes not from the comments content itself, but by meta analysis over a batch of them.

Cons on keep:

  • Redundant information, more storage usage ~~even if we have about 10% of the world's storage~~.

  • Require more processing later if you want to remove the duplication before usage.

So what you guys think?

Also I will share it once it's finished, so if you have a list of YT channels you would like to see in it, leave it here too.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here