Keywords and keyphrases (multi-word units) are widely used in large document collections. They describe the content of single documents and provide a kind of semantic metadata that is useful for a wide variety of purposes. The task of assigning keyphrases to a document is called keyphrase indexing
. For example, academic papers are often accompanied by a set of keyphrases
freely chosen by the author. In libraries professional indexers select keyphrases from a controlled vocabulary (also called Subject Headings
) according to defined cataloguing rules. On the Internet, digital libraries, or any depositories of data (flickr, del.icio.us, blog articles etc.) also use keyphrases (or here called content tags
or content labels
) to organize and provide a thematic access to their data. Kea 5.1
is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary. 1. Documents - Kea Document processing
gets a directory name and processes all documents in this directory that have the extension ".txt". The default language and the encoding is set to English, but this can be changed as long as a corresponding stopword file and a stemmer is provided. 2. Thesaurus - If a vocabulary is provided, Kea matches the documents` phrases against this file. For processing SKOS files stored as rdf files, Kea uses the Jena API. For free indexing, use the option "-v none". 3. Extracting Candidates - Here Kea extracts n-grams of a predefined length (e.g. 1 to 3 words) that do not start or end with a stopword. In controlled indexing, it only collects those n-grams that match thesaurus terms. If the thesaurus defines relations between non-allowed terms (non-descriptors) and allowed terms (descriptors), it replaces each descriptor by an equivalent non-descriptor.
In the above diagram, pseudo-phrase matching means removing stopwords from the phrase, and then stemming and ordering the remaining words. 4. Features - For each candidate phrase Kea computes 4 feature values:
- TFxIDF is a measure describing the specificity of a term for this document under consideration, compared to all other documents in the corpus. Candidate phrases that have high TFxIDF value are more likely to be keyphrases.
- First occurrence is computed as the percentage of the document preceeding the first occurrence of the term in the document. Terms that tend to appear at the start or at the end of a document are more likely to be keyphrases.
- Length of a phrase is the number of its component words. Two-word phrases are usually preferred by human indexers.
- Node degree of a candidate phrase is the number of phrases in the candidate set that are semantically related to this phrase. This is computed with the help of the thesaurus. Phrases with high degree are more likely to be keyphrases.
5. Building the model - Before being able to extract keyphrases from new documents, Kea download
first needs to create a model that learns the extraction strategy from manually indexed documents. This means, for each document in the input directory there must be a file with the extension ".key" and the same name as the corresponding document. This file should contain manually assigned keyphrases, one per line.
Given the list of the candidate phrases (3.), Kea marks those that were manually assigned as positive example and all the rest as negative examples. By analyzing the feature values (4.) for positive and negative candidate phrases, a model is computed, which reflects the distribution of feature values for each phrase. 6. Extracting keyphrases - When extracting keyphrases from new documents, Kea takes the model (5.) and feature values for each candidate phrase and computes its probability of being a keyphrase. Phrases with the highest probabilities are selected into the final set of keyphrases. The user can specify the number of keyphrases that need to be selected.