org.apache.nutch.crawl
Class CrawlDbMerger
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.nutch.crawl.CrawlDbMerger
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class CrawlDbMerger
- extends org.apache.hadoop.conf.Configured
- implements org.apache.hadoop.util.Tool
This tool merges several CrawlDb-s into one, optionally filtering
URLs through the current URLFilters, to skip prohibited
pages.
It's possible to use this tool just for filtering - in that case
only one CrawlDb should be specified in arguments.
If more than one CrawlDb contains information about the same URL,
only the most recent version is retained, as determined by the
value of CrawlDatum.getFetchTime()
.
However, all metadata information from all versions is accumulated,
with newer values taking precedence over older values.
- Author:
- Andrzej Bialecki
Method Summary |
static org.apache.hadoop.mapred.JobConf |
createMergeJob(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path output,
boolean normalize,
boolean filter)
|
static void |
main(String[] args)
|
void |
merge(org.apache.hadoop.fs.Path output,
org.apache.hadoop.fs.Path[] dbs,
boolean normalize,
boolean filter)
|
int |
run(String[] args)
|
Methods inherited from class org.apache.hadoop.conf.Configured |
getConf, setConf |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.hadoop.conf.Configurable |
getConf, setConf |
CrawlDbMerger
public CrawlDbMerger()
CrawlDbMerger
public CrawlDbMerger(org.apache.hadoop.conf.Configuration conf)
merge
public void merge(org.apache.hadoop.fs.Path output,
org.apache.hadoop.fs.Path[] dbs,
boolean normalize,
boolean filter)
throws Exception
- Throws:
Exception
createMergeJob
public static org.apache.hadoop.mapred.JobConf createMergeJob(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path output,
boolean normalize,
boolean filter)
main
public static void main(String[] args)
throws Exception
- Parameters:
args
-
- Throws:
Exception
run
public int run(String[] args)
throws Exception
- Specified by:
run
in interface org.apache.hadoop.util.Tool
- Throws:
Exception
Copyright © 2006 The Apache Software Foundation