org.apache.nutch.crawl
Class Generator

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.crawl.Generator
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class Generator
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool

Generates a subset of a crawl db to fetch.


Nested Class Summary
static class Generator.CrawlDbUpdater
          Update the CrawlDB so that the next generate won't include the same URLs.
static class Generator.DecreasingFloatComparator
           
static class Generator.HashComparator
          Sort fetch lists by hash of URL.
static class Generator.PartitionReducer
           
static class Generator.Selector
          Selects entries due for fetch.
static class Generator.SelectorEntry
           
static class Generator.SelectorInverseMapper
           
 
Field Summary
static String CRAWL_GEN_CUR_TIME
           
static String CRAWL_GEN_DELAY
           
static String CRAWL_GENERATE_FILTER
           
static String CRAWL_TOP_N
           
static String GENERATE_MAX_PER_HOST
           
static String GENERATE_MAX_PER_HOST_BY_IP
           
static String GENERATE_UPDATE_CRAWLDB
           
static org.apache.commons.logging.Log LOG
           
 
Constructor Summary
Generator()
           
Generator(org.apache.hadoop.conf.Configuration conf)
           
 
Method Summary
 org.apache.hadoop.fs.Path generate(org.apache.hadoop.fs.Path dbDir, org.apache.hadoop.fs.Path segments, int numLists, long topN, long curTime)
          Generate fetchlists in a segment.
 org.apache.hadoop.fs.Path generate(org.apache.hadoop.fs.Path dbDir, org.apache.hadoop.fs.Path segments, int numLists, long topN, long curTime, boolean filter, boolean force)
          Generate fetchlists in a segment.
static String generateSegmentName()
           
static void main(String[] args)
          Generate a fetchlist from the crawldb.
 int run(String[] args)
           
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

CRAWL_GENERATE_FILTER

public static final String CRAWL_GENERATE_FILTER
See Also:
Constant Field Values

GENERATE_MAX_PER_HOST_BY_IP

public static final String GENERATE_MAX_PER_HOST_BY_IP
See Also:
Constant Field Values

GENERATE_MAX_PER_HOST

public static final String GENERATE_MAX_PER_HOST
See Also:
Constant Field Values

GENERATE_UPDATE_CRAWLDB

public static final String GENERATE_UPDATE_CRAWLDB
See Also:
Constant Field Values

CRAWL_TOP_N

public static final String CRAWL_TOP_N
See Also:
Constant Field Values

CRAWL_GEN_CUR_TIME

public static final String CRAWL_GEN_CUR_TIME
See Also:
Constant Field Values

CRAWL_GEN_DELAY

public static final String CRAWL_GEN_DELAY
See Also:
Constant Field Values

LOG

public static final org.apache.commons.logging.Log LOG
Constructor Detail

Generator

public Generator()

Generator

public Generator(org.apache.hadoop.conf.Configuration conf)
Method Detail

generate

public org.apache.hadoop.fs.Path generate(org.apache.hadoop.fs.Path dbDir,
                                          org.apache.hadoop.fs.Path segments,
                                          int numLists,
                                          long topN,
                                          long curTime)
                                   throws IOException
Generate fetchlists in a segment. Whether to filter URLs or not is read from the crawl.generate.filter property in the configuration files. If the property is not found, the URLs are filtered.

Parameters:
dbDir - Crawl database directory
segments - Segments directory
numLists - Number of reduce tasks
topN - Number of top URLs to be selected
curTime - Current time in milliseconds
Returns:
Path to generated segment or null if no entries were selected
Throws:
IOException - When an I/O error occurs

generate

public org.apache.hadoop.fs.Path generate(org.apache.hadoop.fs.Path dbDir,
                                          org.apache.hadoop.fs.Path segments,
                                          int numLists,
                                          long topN,
                                          long curTime,
                                          boolean filter,
                                          boolean force)
                                   throws IOException
Generate fetchlists in a segment.

Returns:
Path to generated segment or null if no entries were selected.
Throws:
IOException

generateSegmentName

public static String generateSegmentName()

main

public static void main(String[] args)
                 throws Exception
Generate a fetchlist from the crawldb.

Throws:
Exception

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface org.apache.hadoop.util.Tool
Throws:
Exception


Copyright © 2006 The Apache Software Foundation