this isn’t really even related to GenAI at all
Besides the ocr there appears to be all sorts of image-to-text metadata recorded, the nadella demo had the journalist supposedly doing a search and getting results with terms that were neither typed at the time nor appearing in the stored screenshots.
Also, I thought they might be doing something image-to-text-to-image-again related (which - I read somewhere - was what bing copilot did when you asked it to edit an image) to save space, instead of storing eleventy billion multimonitor screenshots forever.
edit - in the demo the results included screens.
Honestly, the evident plethora of poor programming practices is the least notable thing about all this; using roided autocomplete to cut corners was never going to be a well calculated decision, it's always the cherry on top of a shit-cake.