@@ -13,3 +13,5 @@ The job of a search engine is to retrieve useful information for users. This is
* Anna's Archive is ~0.5PB. This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet.
-* We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month.
+* {We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month.
+* It is possible that scraping can't be done by new entrants. Much of the web is useless so this is "fine", but Reddit still has some knowledge to it, as do obscure blogs.
+}
* {There is lots of alpha in weird corners of Twitter and also Discord. It would be useful to scrape these, though people would complain.