|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
Packages that use CrawlDatum | |
---|---|
org.apache.nutch.analysis.lang | Text document language identifier. |
org.apache.nutch.crawl | Crawl control code. |
org.apache.nutch.fetcher | The Nutch robot. |
org.apache.nutch.indexer | Maintain Lucene full-text indexes. |
org.apache.nutch.indexer.basic | A basic indexing plugin. |
org.apache.nutch.indexer.more | A more indexing plugin. |
org.apache.nutch.microformats.reltag | A microformats Rel-Tag Parser/Indexer/Querier plugin. |
org.apache.nutch.protocol | |
org.apache.nutch.protocol.file | Protocol plugin which supports retrieving local file resources. |
org.apache.nutch.protocol.ftp | Protocol plugin which supports retrieving documents via the ftp protocol. |
org.apache.nutch.protocol.http | Protocol plugin which supports retrieving documents via the http protocol. |
org.apache.nutch.protocol.http.api | Common API used by HTTP plugins (http ,
httpclient ) |
org.apache.nutch.protocol.httpclient | Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. |
org.apache.nutch.scoring | |
org.apache.nutch.scoring.opic | |
org.apache.nutch.tools | |
org.apache.nutch.tools.compat | |
org.creativecommons.nutch | Sample plugins that parse and index Creative Commons medadata. |
Uses of CrawlDatum in org.apache.nutch.analysis.lang |
---|
Methods in org.apache.nutch.analysis.lang with parameters of type CrawlDatum | |
---|---|
Document |
LanguageIndexingFilter.filter(Document doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.crawl |
---|
Fields in org.apache.nutch.crawl declared as CrawlDatum | |
---|---|
CrawlDatum |
Generator.SelectorEntry.datum
|
Methods in org.apache.nutch.crawl that return CrawlDatum | |
---|---|
CrawlDatum |
FetchSchedule.forceRefetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching. |
CrawlDatum |
AbstractFetchSchedule.forceRefetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching. |
CrawlDatum |
CrawlDbReader.get(String crawlDb,
String url,
org.apache.hadoop.conf.Configuration config)
|
CrawlDatum |
FetchSchedule.initializeSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Initialize fetch schedule related data. |
CrawlDatum |
AbstractFetchSchedule.initializeSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Initialize fetch schedule related data. |
static CrawlDatum |
CrawlDatum.read(DataInput in)
|
CrawlDatum |
DefaultFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
CrawlDatum |
AdaptiveFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
CrawlDatum |
FetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
AbstractFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
FetchSchedule.setPageGoneSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. |
CrawlDatum |
AbstractFetchSchedule.setPageGoneSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. |
CrawlDatum |
FetchSchedule.setPageRetrySchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. |
CrawlDatum |
AbstractFetchSchedule.setPageRetrySchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. |
Methods in org.apache.nutch.crawl that return types with arguments of type CrawlDatum | |
---|---|
org.apache.hadoop.mapred.RecordWriter<org.apache.hadoop.io.Text,CrawlDatum> |
CrawlDbReader.CrawlDatumCsvOutputFormat.getRecordWriter(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.mapred.JobConf job,
String name,
org.apache.hadoop.util.Progressable progress)
|
Methods in org.apache.nutch.crawl with parameters of type CrawlDatum | |
---|---|
long |
FetchSchedule.calculateLastFetchTime(CrawlDatum datum)
Calculates last fetch time of the given CrawlDatum. |
long |
AbstractFetchSchedule.calculateLastFetchTime(CrawlDatum datum)
This method return the last fetch time of the CrawlDatum |
CrawlDatum |
FetchSchedule.forceRefetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching. |
CrawlDatum |
AbstractFetchSchedule.forceRefetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching. |
static boolean |
CrawlDatum.hasDbStatus(CrawlDatum datum)
|
static boolean |
CrawlDatum.hasFetchStatus(CrawlDatum datum)
|
CrawlDatum |
FetchSchedule.initializeSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Initialize fetch schedule related data. |
CrawlDatum |
AbstractFetchSchedule.initializeSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Initialize fetch schedule related data. |
void |
Generator.Selector.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.FloatWritable,Generator.SelectorEntry> output,
org.apache.hadoop.mapred.Reporter reporter)
Select & invert subset due for fetch. |
void |
CrawlDbReader.CrawlDbTopNMapper.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.FloatWritable,org.apache.hadoop.io.Text> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
CrawlDbFilter.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
CrawlDbReader.CrawlDbStatMapper.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
CrawlDatum.set(CrawlDatum that)
Copy the contents of another instance into this instance. |
CrawlDatum |
DefaultFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
CrawlDatum |
AdaptiveFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
CrawlDatum |
FetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
AbstractFetchSchedule.setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
FetchSchedule.setPageGoneSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. |
CrawlDatum |
AbstractFetchSchedule.setPageGoneSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. |
CrawlDatum |
FetchSchedule.setPageRetrySchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. |
CrawlDatum |
AbstractFetchSchedule.setPageRetrySchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. |
boolean |
FetchSchedule.shouldFetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist. |
boolean |
AbstractFetchSchedule.shouldFetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist. |
void |
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter.write(org.apache.hadoop.io.Text key,
CrawlDatum value)
|
Method parameters in org.apache.nutch.crawl with type arguments of type CrawlDatum | |
---|---|
void |
CrawlDbFilter.map(org.apache.hadoop.io.Text key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
Injector.InjectMapper.map(org.apache.hadoop.io.WritableComparable key,
org.apache.hadoop.io.Text value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
Generator.CrawlDbUpdater.map(org.apache.hadoop.io.WritableComparable key,
org.apache.hadoop.io.Writable value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
Injector.InjectReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
Injector.InjectReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
Generator.CrawlDbUpdater.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
Generator.CrawlDbUpdater.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
CrawlDbReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
CrawlDbReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
CrawlDbMerger.Merger.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
CrawlDbMerger.Merger.reduce(org.apache.hadoop.io.Text key,
Iterator<CrawlDatum> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
void |
Generator.PartitionReducer.reduce(org.apache.hadoop.io.Text key,
Iterator<Generator.SelectorEntry> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
Uses of CrawlDatum in org.apache.nutch.fetcher |
---|
Methods in org.apache.nutch.fetcher that return CrawlDatum | |
---|---|
CrawlDatum |
FetcherOutput.getCrawlDatum()
|
Method parameters in org.apache.nutch.fetcher with type arguments of type CrawlDatum | |
---|---|
void |
Fetcher2.run(org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,CrawlDatum> input,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,NutchWritable> output,
org.apache.hadoop.mapred.Reporter reporter)
|
Constructors in org.apache.nutch.fetcher with parameters of type CrawlDatum | |
---|---|
FetcherOutput(CrawlDatum crawlDatum,
Content content,
ParseImpl parse)
|
Uses of CrawlDatum in org.apache.nutch.indexer |
---|
Methods in org.apache.nutch.indexer with parameters of type CrawlDatum | |
---|---|
Document |
IndexingFilter.filter(Document doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a parse. |
Document |
IndexingFilters.filter(Document doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Run all defined filters. |
Uses of CrawlDatum in org.apache.nutch.indexer.basic |
---|
Methods in org.apache.nutch.indexer.basic with parameters of type CrawlDatum | |
---|---|
Document |
BasicIndexingFilter.filter(Document doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.indexer.more |
---|
Methods in org.apache.nutch.indexer.more with parameters of type CrawlDatum | |
---|---|
Document |
MoreIndexingFilter.filter(Document doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.microformats.reltag |
---|
Methods in org.apache.nutch.microformats.reltag with parameters of type CrawlDatum | |
---|---|
Document |
RelTagIndexingFilter.filter(Document doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.protocol |
---|
Methods in org.apache.nutch.protocol with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
Protocol.getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Returns the Content for a fetchlist entry. |
RobotRules |
Protocol.getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Retrieve robot rules applicable for this url. |
Uses of CrawlDatum in org.apache.nutch.protocol.file |
---|
Methods in org.apache.nutch.protocol.file with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
File.getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
|
RobotRules |
File.getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
|
Constructors in org.apache.nutch.protocol.file with parameters of type CrawlDatum | |
---|---|
FileResponse(URL url,
CrawlDatum datum,
File file,
org.apache.hadoop.conf.Configuration conf)
|
Uses of CrawlDatum in org.apache.nutch.protocol.ftp |
---|
Methods in org.apache.nutch.protocol.ftp with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
Ftp.getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
|
RobotRules |
Ftp.getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
|
Constructors in org.apache.nutch.protocol.ftp with parameters of type CrawlDatum | |
---|---|
FtpResponse(URL url,
CrawlDatum datum,
Ftp ftp,
org.apache.hadoop.conf.Configuration conf)
|
Uses of CrawlDatum in org.apache.nutch.protocol.http |
---|
Methods in org.apache.nutch.protocol.http with parameters of type CrawlDatum | |
---|---|
protected Response |
Http.getResponse(URL url,
CrawlDatum datum,
boolean redirect)
|
Constructors in org.apache.nutch.protocol.http with parameters of type CrawlDatum | |
---|---|
HttpResponse(HttpBase http,
URL url,
CrawlDatum datum)
|
Uses of CrawlDatum in org.apache.nutch.protocol.http.api |
---|
Methods in org.apache.nutch.protocol.http.api with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
HttpBase.getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
|
protected abstract Response |
HttpBase.getResponse(URL url,
CrawlDatum datum,
boolean followRedirects)
|
RobotRules |
HttpBase.getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
|
Uses of CrawlDatum in org.apache.nutch.protocol.httpclient |
---|
Methods in org.apache.nutch.protocol.httpclient with parameters of type CrawlDatum | |
---|---|
protected Response |
Http.getResponse(URL url,
CrawlDatum datum,
boolean redirect)
Fetches the url with a configured HTTP client and
gets the response. |
Uses of CrawlDatum in org.apache.nutch.scoring |
---|
Methods in org.apache.nutch.scoring that return CrawlDatum | |
---|---|
CrawlDatum |
ScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages. |
CrawlDatum |
ScoringFilters.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
Methods in org.apache.nutch.scoring with parameters of type CrawlDatum | |
---|---|
CrawlDatum |
ScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages. |
CrawlDatum |
ScoringFilters.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
float |
ScoringFilter.generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort)
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation. |
float |
ScoringFilters.generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort)
Calculate a sort value for Generate. |
float |
ScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
Document doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
This method calculates a Lucene document boost. |
float |
ScoringFilters.indexerScore(org.apache.hadoop.io.Text url,
Document doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
|
void |
ScoringFilter.initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Set an initial score for newly discovered pages. |
void |
ScoringFilters.initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Calculate a new initial score, used when adding newly discovered pages. |
void |
ScoringFilter.injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Set an initial score for newly injected pages. |
void |
ScoringFilters.injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Calculate a new initial score, used when injecting new pages. |
void |
ScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into Content metadata. |
void |
ScoringFilters.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
|
void |
ScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages. |
void |
ScoringFilters.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
Calculate updated page score during CrawlDb.update(). |
Method parameters in org.apache.nutch.scoring with type arguments of type CrawlDatum | |
---|---|
CrawlDatum |
ScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages. |
CrawlDatum |
ScoringFilters.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
void |
ScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages. |
void |
ScoringFilters.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
Calculate updated page score during CrawlDb.update(). |
Uses of CrawlDatum in org.apache.nutch.scoring.opic |
---|
Methods in org.apache.nutch.scoring.opic that return CrawlDatum | |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply. |
Methods in org.apache.nutch.scoring.opic with parameters of type CrawlDatum | |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply. |
float |
OPICScoringFilter.generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort)
Use getScore() . |
float |
OPICScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
Document doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Dampen the boost value by scorePower. |
void |
OPICScoringFilter.initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level. |
void |
OPICScoringFilter.injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Set to the value defined in config, 1.0f by default. |
void |
OPICScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY. |
void |
OPICScoringFilter.updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List inlinked)
Increase the score by a sum of inlinked scores. |
Method parameters in org.apache.nutch.scoring.opic with type arguments of type CrawlDatum | |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply. |
Uses of CrawlDatum in org.apache.nutch.tools |
---|
Method parameters in org.apache.nutch.tools with type arguments of type CrawlDatum | |
---|---|
void |
FreeGenerator.FG.reduce(org.apache.hadoop.io.Text key,
Iterator<Generator.SelectorEntry> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
Uses of CrawlDatum in org.apache.nutch.tools.compat |
---|
Methods in org.apache.nutch.tools.compat with parameters of type CrawlDatum | |
---|---|
void |
CrawlDbConverter.map(org.apache.hadoop.io.WritableComparable key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
Method parameters in org.apache.nutch.tools.compat with type arguments of type CrawlDatum | |
---|---|
void |
CrawlDbConverter.map(org.apache.hadoop.io.WritableComparable key,
CrawlDatum value,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,CrawlDatum> output,
org.apache.hadoop.mapred.Reporter reporter)
|
Uses of CrawlDatum in org.creativecommons.nutch |
---|
Methods in org.creativecommons.nutch with parameters of type CrawlDatum | |
---|---|
Document |
CCIndexingFilter.filter(Document doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
|
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |