May 26, 2009
Topsy is a new search engine, and is now open to the public!
The first search engines saw the Web as a collection of documents, and indexed and searched for keywords in those documents. They ranked results based on how well a document matched search terms. Google had a new approach, and saw the Web as a network of documents. Modern search engines rank results based not only on how well a document matches search terms, but also how much a document is cited by other documents.
Today, with the explosion of social platforms like Twitter, social networks and blogging communities, more and more links to interesting information originate from people, from the conversation streams within the social web. Topsy is designed to mine the collective signal from the social web in real-time. To measure the relative importance of each search result, Topsy examines the links being cited, the description of these links and the influence of each person citing a link. Topsy augments traditional search engines by finding interesting and relevant information that people are talking about. If other people are discussing it, Topsy thinks you might find it interesting and relevant too!
“Search powered by the social web” is a simple phrase, which hides many interesting problems. Topsy’s approach is an attempt at solving these problems with a range of new technologies – and initially, a dataset based exclusively on the conversations taking place in the wonderful Twitter community.
The social web is not a network of documents. It is conversation in a network of people. The social web generates a stream of citations of things – documents, videos, pictures, etc – that people are talking about. Searching through this stream of citations means separating the network of people from the things they discuss. Being able to do this is key to Topsy’s approach.
Identifying people as distinct from what they talk about allows Topsy to implement features such as Trackback pages – where you can see what people are saying about a particular web page. Or User pages, where you can see the links individual people have been talking about.
The conversations on the social web are not in the form of full web pages. They are much smaller. Tweets, or individual blog posts, or comments or reviews. They are mixed up – many people may be saying many things on the same web page. And they come in a continuously updating stream – freshness matters. Who is saying what, and when, is something that Topsy tries to capture. This allows Topsy to show search results based on conversations taking place in the past month, week, day or sometimes even hour.
Identifying authors also allows Topsy to compute the influence of individuals, and rank links in search results based on the influence of people talking about those links. Computing influence is not just a good way of finding relevant search results. It is essential to filter out spam, which is unfortunately a significant part of the social web. Our next post, on how Topsy computes influence, is the first of many occasional posts we’ll publish about our technology.