osmarks.net research teams designed code to download and embed all images ever posted (and not deleted) from Reddit (excluding NSFW, ads, etc), using streaming processing to avoid having to persist intractable amounts of data to disk. Unfortunately, it is still necessary to store embeddings, so 0.8TB of storage is still required (estimated), as well as a month of compute time. Due to the unanticipated complexity of high-performance high-recall vector indexing on osmarks.net compute budgets, the project required more development timeslices than predicted, but has been completed.