Data Hoarder

221 readers

1 users here now

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time (tm) ). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

founded 2 years ago

MODERATORS

communick@selfhosted.forum

HTML to Word (Sitesucker) (alien.top)

submitted 2 years ago by Puzzleheaded-Cold495@alien.top to c/datahoarder@selfhosted.forum

2 comments fedilink hide all child comments

I used to follow a blogger www.aordisco.com - It seems inactive in its original form, and a few years ago it vanished due to a copywrite stake or legal action, I just checked to see if I could recover any content and found to my surprise it has reappeared. It features a wealth of info regarding 70-80s music and as I followed the weekly posts I built an iTunes playlist - which I later lost.

I’m currently rebuilding the playlist from info on every page, it’s rather tedious. I archived the whole site with sitesucker as html.

TLDR is there a way to extract the text from html archive to a single text file or similar.

top 2 comments

sorted by: hot top controversial new old

[–] dr100@alien.top 1 points 2 years ago

The html is already text. If you want to take out formatting and kind of show it how a browser would just use a text browser, like lynx, with -dump if you wish:

       -dump  dumps the formatted output of the default document or those specified on the command  line  to  standard
          output.  Unlike interactive mode, all documents are processed.  This can be used in the following way:

              lynx -dump http://www.subir.com/lynx.html

          Files  specified  on  the command line are formatted as HTML if their names end with one of the standard
          web suffixes such as “.htm” or “.html”.  Use the -force_html option to format files whose names  do  not
          follow this convention.

[–] recom273@alien.top 1 points 2 years ago

Yes, I realise it’s text. But i have 15 years of archives. Site sucker has put them into months and year folders, is there a way to pull the contents of the folder via site sucker or whatever process and extract the text?