Improving Search with OpenAI's Embedding API
3

Building the Paul Graham Essay Corpus

Before we can use the embedding models and API to generate the Paul Graham essay embeddings, we need to prepare the text! Let's start by building out our corpus of Paul Graham essays, which we'll do by scraping the content from his website at paulgraham.com. If you'd like to familiarize yourself with his content before diving in, here's my favorite of his essays: Schlep Blindness.

Paul Graham RSS Feed

The first step is to procure a list of URLs for all the essays. Thankfully, the late Aaron Swartz created a fantastic RSS feed with links to all of Paul Graham's essays. An RSS feed is a simple XML file that provides a standardized interface to programmatically monitor changes to a website, typically used by blogs and news sites to syndicate their content. For example, you can access the New York Times homepage RSS feed which will provide you with an updated list of the latest articles on their homepage.

We can start by using the feedparser Python package to parse the RSS feed. Next, we'll extract the title and URL of each essay in the feed as a list of Python dictionaries.

Python
import feedparser

# parse the RSS feed using feedparser
feed = feedparser.parse("http://www.aaronsw.com/2002/feeds/pgessays.rss")

# list comprehension to create a dict with title and url for each essay
# we'll also ignore any essays that are hosted on turbifycdn
essays = [
	{"title": entry["title"], "url": entry["link"]}
	for entry in feed["entries"]
	if 'turbifycdn' not in entry["link"]
]

print(essays)
Output
[
	{'title': 'Superlinear Returns', 'url': 'http://www.paulgraham.com/superlinear.html'},
	{'title': 'How to Do Great Work', 'url': 'http://www.paulgraham.com/greatwork.html'},
	{'title': 'How to Get New Ideas', 'url': 'http://www.paulgraham.com/getideas.html'}
	...
]

Essay Scraping

Next, we need to scrape the content for each essay. Most people would approach this task using a Python package like BeautifulSoup or Scrapy, but there is a much simpler alternative for news-like articles and blogs: Newspaper3k. Newspaper3k is a Python package that can automatically extract the main text content from a news article or blog post without needing to mess around with HTML or the DOM.

Using Newspaper3k's Article class, we can write a simple scrape_essay function to extract a list of paragraph strings from an essay, given its URL on Paul Graham's website. In this function, we first download and parse the essay using built-in Newspaper3k functionality (article.download() and article.parse()). Then, we split the full text of the essay (given by article.text) into paragraphs and retain every paragraph with at least 10 words. We also exclude footnotes that start with the [ character.

Python
from newspaper import Article

def scrape_essay(essay_url: str) -> list[str]:

	# download and parse essay using Newspaper3k
	article = Article(essay_url)
	article.download()
	article.parse()
	essay = article.text

	# split essay into paragraphs and filter out short paragraphs and footnotes
	paragraphs = [
		paragraph.strip()
		for paragraph in article.text.splitlines()
		if len(paragraph.strip().split()) >= 10
		and not paragraph.startswith('[')
	]

	return paragraphs

The final step is to build our corpus by calling the scrape_essay function on every essay that we retrieved from the RSS feed. We can do this with two nested for-loops where we'll save each paragraph as a document, along with the title and URL of the essay that it came from and its position in the essay.

Python
corpus = []

for essay in essays:
	print(f"Loading essay: {essay['title']}")
	paragraphs = scrape_essay(essay["url"])

	for i, paragraph in enumerate(paragraphs):
		corpus.append({
			"essay_title": essay["title"],
			"essay_url": essay["url"],
			"paragraph_index": i,
			"text": paragraph
		})
Output
Loading essay: Superlinear Returns
Loading essay: How to Do Great Work
Loading essay: How to Get New Ideas
Loading essay: The Need to Read
Loading essay: What You (Want to)* Want
Loading essay: Alien Truth
...

Our final corpus covers all 217 essays that Paul Graham has written since 2001, with a total of 7,533 paragraphs (and documents in our case), 486,185 words, and 2,739,261 characters.