This article is part of the research project which I carried out as part of my MSc programme with the University Of Glasgow. The blog is subject to the copyright which is held by the University Of Glasgow UK. This blog is intended to give an insight into the user profile construction which is useful in personalizing search engines.
Search personalization is gaining it’s usage in the recent years. Search personalization involves constructing user profiles which represents the user’s domain of interests. Search can be personalized depending on the user’s long term interest (very usefull in personalization) and short term action history which is used in session-wise personalization. Long term interests can be extracted from user’s behaviour exhibited in social media where as short term interests can be extracted from search engine results click through analysis, cookies and queries fired through the search engine by the user.
In this blog, we will consider the user profile construction capturing the user’s long term interests through a social bookmarking site called Delicious.com. Users bookmark and collaborate wide range of web URLs and articles along with the tags which are an accurate description of the the URL or the article. Research has shown that tags and the bookmarks in Delicious.com accurately represents the user’s interests. Part of my project involved designing and implementing user profile construction using clustering algorithm for various users of Delicous.com.
Users of delcious.com exhibit varied interests. These interests can be represented as clustered documents containing the tags and its bookmarks. Normally a general user may exhibit 5 to 6 domains of interests (say news, sports, physics, health). We can create clusters of documents where each cluster represents user’s particular domain of interest. “Fast Greed” algorithm can be used to create document clusters. The algorithm can be used to construct user profiles on the fly when the user logs into the system so that the freshness of the user’s interests can be maintained.
User profile construction using the Fast Greed algorithm in 9 simple steps:
Step 1 : Get raw user profile parsed from Heterec collection (Hetrec collection is a delicious.com collection of users bookmarked document along with tags publised in 2011 in Hetrec conference in Australia).
Step 2 : Construct bipartite graph using the raw profile retrieved from Step 1 (Bipartite graph is a 2D matrix representing user’s document and its tag).
Step 3 : Use Fast Greedy Algorithm (FGA) to construct document clusters where each cluster represents the user’s domain of interest.
Step 4 : For each of the clusters generated in Step 4, get the tags associated with the documents within the clusters along with the frequency of occurrences of these tags within the clusters. These cluster IDs along with the tags servers as signature of the cluster.
Step 5 : Return the cluster signatures as user profile.
(Step 3 : FGA) : Fast greedy algorithm is the fastest know clustering algorithm:
Step 1) Get the bipartite graph constructed in the above algorithn and generate a document-document (DD) network matrix. DD network matrix is a two dimensional matrix of documents, where association between documents can be calculated. Two documents are associated if they have similar tagging features in common.
Step 2) Using the document-document matrix with an assumption that each document is a cluster, join two clusters if they have edges between them and yields maximum modularity Q (Modularity is the measure of ratio of association of documents within a cluster and association between clusters. The real equation is out of the scope of this blog). Q is calculated between each pair of related clusters and clusters are joined together if they yield maximum value of Q.
Step 3) Repeat Step 2 until no edges exists between clusters or number of clusters formed reaches pre-defined limit.
Step 4) Return the list of clusters where each cluster contains related documents indicating a user’s domain of interests.
Sunil Ganiger is a Senior Consultant with Compassites Software. He is an MSc graduate from the University Of Glasgow UK. His current technology domain is .NET. His academic interests include search engine personalization and reading general computer science research papers. His non academic interests include watching advertisements and movies.