org.apache.nutch.util
Class URLUtil

java.lang.Object
  extended by org.apache.nutch.util.URLUtil

public class URLUtil
extends Object

Utility class for URL analysis


Constructor Summary
URLUtil()
           
 
Method Summary
static String chooseRepr(String src, String dst, boolean temp)
          Given two urls (source and destination of the redirect), returns the representative one.
static String getDomainName(String url)
          Returns the domain name of the url.
static String getDomainName(URL url)
          Returns the domain name of the url.
static DomainSuffix getDomainSuffix(String url)
          Returns the DomainSuffix corresponding to the last public part of the hostname
static DomainSuffix getDomainSuffix(URL url)
          Returns the DomainSuffix corresponding to the last public part of the hostname
static String[] getHostSegments(String url)
          Partitions of the hostname of the url by "."
static String[] getHostSegments(URL url)
          Partitions of the hostname of the url by "."
static boolean isSameDomainName(String url1, String url2)
          Returns whether the given urls have the same domain name.
static boolean isSameDomainName(URL url1, URL url2)
          Returns whether the given urls have the same domain name.
static void main(String[] args)
          For testing
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

URLUtil

public URLUtil()
Method Detail

getDomainName

public static String getDomainName(URL url)
Returns the domain name of the url. The domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example
getDomainName(conf, new URL(http://lucene.apache.org/))
will return
apache.org


getDomainName

public static String getDomainName(String url)
                            throws MalformedURLException
Returns the domain name of the url. The domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example
getDomainName(conf, new http://lucene.apache.org/)
will return
apache.org

Throws:
MalformedURLException

isSameDomainName

public static boolean isSameDomainName(URL url1,
                                       URL url2)
Returns whether the given urls have the same domain name. As an example,
isSameDomain(new URL("http://lucene.apache.org") , new URL("http://people.apache.org/"))
will return true.

Returns:
true if the domain names are equal

isSameDomainName

public static boolean isSameDomainName(String url1,
                                       String url2)
                                throws MalformedURLException
Returns whether the given urls have the same domain name. As an example,
isSameDomain("http://lucene.apache.org" ,"http://people.apache.org/")
will return true.

Returns:
true if the domain names are equal
Throws:
MalformedURLException

getDomainSuffix

public static DomainSuffix getDomainSuffix(URL url)
Returns the DomainSuffix corresponding to the last public part of the hostname


getDomainSuffix

public static DomainSuffix getDomainSuffix(String url)
                                    throws MalformedURLException
Returns the DomainSuffix corresponding to the last public part of the hostname

Throws:
MalformedURLException

getHostSegments

public static String[] getHostSegments(URL url)
Partitions of the hostname of the url by "."


getHostSegments

public static String[] getHostSegments(String url)
                                throws MalformedURLException
Partitions of the hostname of the url by "."

Throws:
MalformedURLException

chooseRepr

public static String chooseRepr(String src,
                                String dst,
                                boolean temp)
Given two urls (source and destination of the redirect), returns the representative one.

Implements the algorithm described here:
How does the Yahoo! webcrawler handle redirects?

The algorithm is as follows:

  1. Choose target url if either url is malformed.
  2. When a page in one domain redirects to a page in another domain, choose the "target" URL.
  3. When a top-level page in a domain presents a permanent redirect to a page deep within the same domain, choose the "source" URL.
  4. When a page deep within a domain presents a permanent redirect to a page deep within the same domain, choose the "target" URL.
  5. When a page in a domain presents a temporary redirect to another page in the same domain, choose the "source" URL.
    1. Parameters:
      src - Source url of redirect
      dst - Destination url of redirect
      temp - Flag to indicate if redirect is temporary
      Returns:
      Representative url (either src or dst)

main

public static void main(String[] args)
For testing



Copyright © 2006 The Apache Software Foundation