Existing popular search engines (Google, Bing) are sort of bad. Almost nobody is trying to fix this. [[https://exa.ai/|Exa]] is, but appears to have pivoted toward [[LLM agent]] systems, where they do not have an obvious comparative advantage, and make odd technical decisions such as using IVF/PQ (see [[Vector Indexing]]), apparently hosting things on AWS, and running off disk rather than out of RAM (limiting query throughput: see https://x.com/exaailabs/status/1859690195463045630).

= What should a search engine do?

The job of a search engine is to retrieve useful information for users. This is usually mostly inferred from/determined by the query, but most modern engines make some use of user history and data. How can a search engine work better?

*. Better user intent modelling.
*. Better information sources to retrieve from.
*. Better search, search indexing and understanding of information sources.

= Information sources

* Anna's Archive is ~0.5PB. This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet.
* We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month.
* {There is lots of alpha in weird corners of Twitter and also Discord. It would be useful to scrape these, though people would complain.
* Also IRC, but that's logged worse.
}
* {Images, PDFs, etc contain useful knowledge which hasn't been integrated properly into most things. We need* these.
* Common Crawl doesn't even get PDFs because they're complicated to process!
* Obscure papers, product user manuals, shiny reports from organizations.
}

= Indexing

* Google/Bing/etc are plausibly primarily keyword-based. This is not ideal for most (?) queries, which care about something being "the same sort of thing". Neural reranking since at least 2019.
* Exa uses (mostly?) "Neural PageRank" i.e. contrastive link text/link target modelling. Rationale: link text (or text around link, or whole link-source document? probably mostly former) roughly describes the kind of thing the link points to.
* {Could also do contrastive link co-occurrence modelling. Rationale: things referenced in the same document are likely semantically related.
* This generalizes nicely to images too (Neural PageRank is like CLIP w/ captions). Could probably natively train in same embedding space.
* We benefit from contrastive advances like SigLIP, [[https://arxiv.org/abs/2005.10242]].
}
* "Links" aren't actually trivial. Would need to do substantial work to e.g. find reference targets in poorly digitized papers.
* We need OCR to understand PDFs and images properly (even with a native multimodal encoder OCR is probably necessary for traning). For some reason there are no good open-source solutions. This could maybe be fixed with a synthetic data approach (generate corrupted documents, train on those).
* "Documents" can be quite long and we want to be able to find things in e.g. a book with ~paragraph granularity whilst still understanding the context of the book. Consider [[https://arxiv.org/abs/2004.12832]], hierarchical systems? It would be somewhat cursed, but could index entire book as one vector then postprocess-select paragraphs.
* So much tacit knowledge is in videos. Oh no. Maybe we can get away with an autotranscriber and frame extraction.

= Filtering

* Most content on the internet is bad, but it's hard to know which. Storing the bad things increases costs unnecessarily.
* Manually label on a few dimensions, bootstrap classifier with active learning ([[MemeThresher]])?
* Bootstrap from known-probably-good sources using link graph or semantic similarity.

= User intent modelling

* Few-word queries are a very narrow pipe.
* Heavy per-user customization? Give multiple result lists and "show more like this". Still low-bitrate. Could optimize across multiple requests.
* "Show me things related to this document" (including e.g. user's own draft code, papers, essays).
* Refine query with extra detail.

= Cost

Good high-performance vector index uses ~5TB RAM/billion documents (can cut this down decently if embedding vectors are shorter). DRAM is a bit under £3/GB now so ~£15000/billion documents for only index. DiskANN etc use disk instead for lower throughput for very big (~£0.1/GB) cost savings. Would also need server hardware but main cost is RAM. Could use Optane (<£1/GB, highly variable). Model training has fixed costs around (roughly) £10k for big BERT/CLIP/etc finetune - worse if training from scratch or using really long context. We probably do need at least one copy (in text and original format for later redesigns) of all documents used, though this can go on cold storage at ~£0.02/GB.