Extracting key-words and key-phrases is a common task in Enterprise Search and Information Retrieval. Key-phrases are commonly used by search engines and other indices to categorize texts, build facets, or locate specific data in documents.
We have a problem
Despite the fact that key-phrase extraction is a common task both in industry and in academy, the precision and recall values of the state-of-the-art systems are usually under 30% and 40%. These measures have often been calculated by using a set of key-phrases extracted by humans as a baseline. Precision is the fraction of true positives over the sum of true positives and false positives, recall is the fraction of true positives over the sum of true positives and false negatives. Although human annotators do their best to extract key-phrases, it is not an easy task and sometimes there is no full agreement between two humans on which phrases are the most relevant in a given text. The picture looks to be miserable, but we can see key-phrase extractor almost everywhere. Why? First, the relatively low precision and recall values are not that bad, even they are pretty good compared to other Information Retrieval systems. Second, users find most of the key-phrases natural, or at least somehow informative.
Designing our new key-phrase extracting algorithm was a long journey. We wanted a more or less language independent solution, since in Europe, chances are high that you have to deal with multiple languages. First, we studied data-driven approaches that usually employ a language model (basically huge frequency tables of words or n-grams) and compares the word or n-gram frequencies of a given text to this model. Although these approaches are very good, you have to build a new model for every new language. Also, comparing frequencies can be computationally intensive, even if you want to extract key-phrases in real time.
Finally, we found TextRank, an unsupervised algorithm for key-phrase extraction. TextRank runs on texts and it requires no other inputs. It is a graph-based algorithm inspired by Google’s PageRank. The basic idea is that you can build a graph from the words of a text. Let every word be a node in the graph, you can draw an edge between two words, if they are neighbors (i.e. occurring in a sequence, one before the other) or there is no more than one word between them. This way you can build up a graph, so you can compute the PageRank value of each node. PageRank is a good measure, because it combines the authority (how many node links to it) and hubness (well connected nodes that make authority nodes accessible for many-many less important nodes).
Our linguistic team modified TextRank, so now it takes some linguistic information into consideration (e.g. we are weighting the edges and we are using a directed graph). Although we haven’t reached better evaluation results than the state-of-the-art systems, we felt that our solution gives very natural key-phrases.
Think like a human and evaluate your results
Graph-based models are becoming popular models of human cognition. Studies showed that at least a significant portion of language can be modeled by graph structures. E.g. Griffiths et al. describes an experiment that asked participants to respond with the first word that came into their mind that starts with a specific letter. From a technical point of view, think of this task as giving suggestions based on the initial character. At first, one can think that the most frequent word starting with that character would be great to predict the answers. It has been found that word association networks are better at this task. Griffiths and his fellows built a network from a word association database and computed the PageRank value for each of its nodes to show that. Other experiments suggests that we organize semantic information into networks (Tenenbaum), and language development can be described in terms of growing language networks in our mind (Ke and Yao). Language graphs have been applied to measure the coherence of speech of patients with thought disorders. From this point of view, TextRank can be interpreted as a way of determining the central elements of a text.
We can’t build a product on our intuition, so we designed an online study which asked participants to rate key-phrases on a scale from “Totally relevant” to “Totally irrelevant” and we found 7.6% percent of them are totally relevant, 46.4% are rather relevant, 32.4% somehow relevant, 13.2% are rather irrelevant and 0.4% are totally irrelevant.
It seems that we designed an experiment just to support our claims. This is partly true. We think the good old precision and recall values are good if you can make a baseline. But asking humans to find a few key-phrases is asking them to rank alternatives. TextRank and our modification of the algorithm do the same thing, it ranks every word according to its PageRank value. During our evaluation task we would like to know if humans can accept automatically extracted phrases, i.e. although humans can accept them as key-phrases they render them to a lower rank.