org.apache.nutch.analysis.lang
Class LanguageIndexingFilter
java.lang.Object
org.apache.nutch.analysis.lang.LanguageIndexingFilter
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, IndexingFilter, Pluggable
public class LanguageIndexingFilter
- extends Object
- implements IndexingFilter
An IndexingFilter
that
add a lang
(language) field to the document.
It tries to find the language of the document by:
- First, checking if
HTMLLanguageParser
add some language
information
- Then, checking if a
Content-Language
HTTP header can be
found
- Finaly by analyzing the document content
- Author:
- Sami Siren, Jerome Charron
Method Summary |
Document |
filter(Document doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse. |
org.apache.hadoop.conf.Configuration |
getConf()
|
void |
setConf(org.apache.hadoop.conf.Configuration conf)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LanguageIndexingFilter
public LanguageIndexingFilter()
- Constructs a new Language Indexing Filter.
filter
public Document filter(Document doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
throws IndexingException
- Description copied from interface:
IndexingFilter
- Adds fields or otherwise modifies the document that will be indexed for a
parse. Unwanted documents can be removed from indexing by returning a null value.
- Specified by:
filter
in interface IndexingFilter
- Parameters:
doc
- document instance for collecting fieldsparse
- parse data instanceurl
- page urldatum
- crawl datum for the pageinlinks
- page inlinks
- Returns:
- modified (or a new) document instance, or null (meaning the document
should be discarded)
- Throws:
IndexingException
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
setConf
in interface org.apache.hadoop.conf.Configurable
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
getConf
in interface org.apache.hadoop.conf.Configurable
Copyright © 2006 The Apache Software Foundation