The original post: /r/datahoarder by /u/QLaHPD on 2025-06-20 07:55:02.
Guys, so I'm building a dataset of YouTube comments, I'm trying to be as diverse as possible, taking many types of channels as possible, and, as you can imagine lots and lots of comments are duplicated/spam.
I know this topic isn't only about r/DataHoarder but I guess its worth posting here too, should I keep all comments or remove duplication leaving only the first copy of each?
I thought on these pros and cons:
Pros on keep:
- Spam information, which comes not from the comments content itself, but by meta analysis over a batch of them.
Cons on keep:
-
Redundant information, more storage usage ~~even if we have about 10% of the world's storage~~.
-
Require more processing later if you want to remove the duplication before usage.
So what you guys think?
Also I will share it once it's finished, so if you have a list of YT channels you would like to see in it, leave it here too.