2007-11-16
Heritrix多线程的问题
关键字: Heritrix
我现在是用一台主机抓取数据,所以我想把Heritrix的链接散列到多个线程中,可是当我散列的ELFHashQueueAssignmentPolicy写好后,第一次执行的时候,只能解析出30个dns:任务就自动的结束了,可是,当第二次或是第三次的时候,就可以实现多个线程了
另外我已经把Heritrix.properties文件和AbstractFrontier中相应的位置都已经改了,希望您能帮我看看,谢谢了。
/*******************************************************************************
* 文件说明:
*
* 项目名: WebCrawler
* 文件名: ELFHashAssignmentPolicy.java
* 包名: com.hotct.heritrixExt.common.frontier
*
* 创建人: zhangzhenxin
* 创建时间: 下午03:50:01
* 创建日期: 2007-10-30
******************************************************************************/
package com.hotct.heritrixExt.common.frontier;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.framework.CrawlController;
import org.archive.crawler.frontier.HostnameQueueAssignmentPolicy;
import org.archive.crawler.frontier.QueueAssignmentPolicy;
import org.archive.net.UURI;
import org.archive.net.UURIFactory;
/**
* <h>类型描述</h>
*
* @author zhangzhenxin
* @date 2007-10-30
*/
public class ELFHashAssignmentPolicy extends QueueAssignmentPolicy {
private static final Logger logger = Logger
.getLogger(ELFHashAssignmentPolicy.class.getName());
private static String DEFAULT_CLASS_KEY = "default...";
private static final String DNS = "dns";
/**
*
*/
@Override
public String getClassKey(CrawlController controller, CandidateURI cauri) {
String uri = cauri.getUURI().toString();
String scheme = cauri.getUURI().getScheme();
String candidate = null;
try {
if (scheme.equals(DNS)) {
if (cauri.getVia() != null) {
// Special handling for DNS: treat as being
// of the same class as the triggering URI.
// When a URI includes a port, this ensures
// the DNS lookup goes atop the host:port
// queue that triggered it, rather than
// some other host queue
UURI viaUuri = UURIFactory.getInstance(cauri.flattenVia());
candidate = viaUuri.getAuthorityMinusUserinfo();
// adopt scheme of triggering URI
scheme = viaUuri.getScheme();
} else {
candidate = cauri.getUURI().getReferencedHost();
}
} else {
// String uri = cauri.getUURI().toString();
long hash = ELFHash(uri);
candidate = Long.toString(hash % 100);
}
if (candidate == null || candidate.length() == 0) {
candidate = DEFAULT_CLASS_KEY;
}
} catch (URIException e) {
logger.log(Level.INFO,
"unable to extract class key; using default", e);
candidate = DEFAULT_CLASS_KEY;
}
return candidate.replace(':', '#');
}
public static long ELFHash(String str) {
long hash = 0;
long x = 0;
for (int i = 0; i < str.length(); i++) {
hash = (hash << 4) + str.charAt(i);
if ((x = hash & 0xF0000000L) != 0) {
hash ^= (x >> 24);
hash &= ~x;
}
}
return (hash & 0x7FFFFFFF);
}
}
另外我已经把Heritrix.properties文件和AbstractFrontier中相应的位置都已经改了,希望您能帮我看看,谢谢了。
/*******************************************************************************
* 文件说明:
*
* 项目名: WebCrawler
* 文件名: ELFHashAssignmentPolicy.java
* 包名: com.hotct.heritrixExt.common.frontier
*
* 创建人: zhangzhenxin
* 创建时间: 下午03:50:01
* 创建日期: 2007-10-30
******************************************************************************/
package com.hotct.heritrixExt.common.frontier;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.framework.CrawlController;
import org.archive.crawler.frontier.HostnameQueueAssignmentPolicy;
import org.archive.crawler.frontier.QueueAssignmentPolicy;
import org.archive.net.UURI;
import org.archive.net.UURIFactory;
/**
* <h>类型描述</h>
*
* @author zhangzhenxin
* @date 2007-10-30
*/
public class ELFHashAssignmentPolicy extends QueueAssignmentPolicy {
private static final Logger logger = Logger
.getLogger(ELFHashAssignmentPolicy.class.getName());
private static String DEFAULT_CLASS_KEY = "default...";
private static final String DNS = "dns";
/**
*
*/
@Override
public String getClassKey(CrawlController controller, CandidateURI cauri) {
String uri = cauri.getUURI().toString();
String scheme = cauri.getUURI().getScheme();
String candidate = null;
try {
if (scheme.equals(DNS)) {
if (cauri.getVia() != null) {
// Special handling for DNS: treat as being
// of the same class as the triggering URI.
// When a URI includes a port, this ensures
// the DNS lookup goes atop the host:port
// queue that triggered it, rather than
// some other host queue
UURI viaUuri = UURIFactory.getInstance(cauri.flattenVia());
candidate = viaUuri.getAuthorityMinusUserinfo();
// adopt scheme of triggering URI
scheme = viaUuri.getScheme();
} else {
candidate = cauri.getUURI().getReferencedHost();
}
} else {
// String uri = cauri.getUURI().toString();
long hash = ELFHash(uri);
candidate = Long.toString(hash % 100);
}
if (candidate == null || candidate.length() == 0) {
candidate = DEFAULT_CLASS_KEY;
}
} catch (URIException e) {
logger.log(Level.INFO,
"unable to extract class key; using default", e);
candidate = DEFAULT_CLASS_KEY;
}
return candidate.replace(':', '#');
}
public static long ELFHash(String str) {
long hash = 0;
long x = 0;
for (int i = 0; i < str.length(); i++) {
hash = (hash << 4) + str.charAt(i);
if ((x = hash & 0xF0000000L) != 0) {
hash ^= (x >> 24);
hash &= ~x;
}
}
return (hash & 0x7FFFFFFF);
}
}
发表评论
提醒: 该博客已发表在公共论坛,博客所有留言会成为论坛回贴,留言请注意遵守论坛发贴规则
- 浏览: 4189 次
- 来自: 北京

- 详细资料
搜索本博客
最近加入圈子
最新评论
-
Heritrix多线程的问题
我也遇到相同的问题 ,不知道lz有没有解决 ?
-- by D04540214 -
关于cas的java客户端
今天又仔细的看了一下CASFilter的源代码和login-webflow.xm ...
-- by zzxplayful -
关于cas的java客户端
里面的源代码我已经看过很多遍了,还是解决不了我的问题。我要判断的是 :比如:C用 ...
-- by zzxplayful -
关于cas的java客户端
你用的应该是yale的cas吧?如果是,建议去下载一个客户端的源代码(或者反编译 ...
-- by liangguanhui -
je分词的问题
1.5.3应该已经改了
-- by amw_demon






评论排行榜