The original post: /r/datahoarder by /u/Wizard_of_Od on 2024-08-05 03:43:21.
Dezoomify-rs is a great program, but it always seems to reencode. I was wondering if it was possible to grab all of the tiles at the maximum zoom level and losslessly join them together (like you can losslessly crop or rotate jpegs).
In the past I grabbed a few of the little tiles from the browser cache and IrfanView told me they were encoded at Jpeg 84 with Chroma Subsampling (the default in Dezoomify is Jpeg 80 with no Chrma subsampling; it doesn't have an option for subsampling). From what I have read, if you have to reencode, it is best to reencode with the settings the file was originally created at. However, today I pulled out fragments from the Browser cache (using MzCacheView) and noticed a problem. The image tiles had a jfif extension (apparently a subset of Jpeg) but nothing could parse them. IfranView at least gives me an error message, "bogus Huffman table definition", which doesn't mean much to me (I'm not a coder). There is a JFIF near the start of the file, and the tiles from one of the images had some metadata eg rdf:liCreekside Digital/rdf:li, then I assume the image data begins.
I managed to get Dezoomify-rs to download the tiles by putting something like DirectoryName.iiif after the URL, but the tiles all seem to have been reencoded.
Also, there doesn't seem to be a way to force Dezoomify-rs to download as lossless (Png) without specifying a Filename followed by the .png extension. I want to maintain the automatic file naming functionality so I can download a batch in one sitting without having to specify filenames one by one.
I tried 2 Python scripts but they didn't work for me.
If anyone is able download without reencoding from GAC, could you tell me what tool you are using and exactly what syntax.
Update1: I just remembered that Jpeg has a dimensional limit (just over 64,000 x 64,000 pixels, in comparison the newer WebP is only 16K by 16K). GAC images at zoom levels 7 and 8 (only relatively few are that large) could not be reconstituted as a single file.
I was able to get Dezoomify-rs to download the raw tiles to a specified directory by suffixing -c DirectoryName . I'm not sure what to do with them though. They have names like https_lh3.googleusercontent.com_ci_AL18g_SP6cLRt0FWKGWHxH_TRSc-uHNzi6LmyDPGx3NjZWx6cuXfwkmSGDlq1ANqscwbsyR93EZUdw=x1-y13-z5-tG2Gy1wuJECyG1JJMtpEX9j4DxJk . If I change the humongous extension to jfif I can open them in an image viewer (IrfanViewer told me the tile in question was 'JPEG, Adobe RGB (1998), quality: 89, subsampling O'). I'm not sure all of the tiles for every image on GAC has exactly these same parameters, but Dezoomify definitely seems to be throwing away the Adobe RGB colorspace information when it resaves.
Even if I can find a tool to losslessly reassemble the full images, I will still have to manually rename each image. There is no descriptive metadata in the tiles that I can see (eg Artist / Title /Date); with most audio files I can rename them from the metadata if need me using something like Mp3Tag.