2010-06-03

改变 Nutch 对 robots.txt 的解析实现

昨天晚上，改进后的 Nutch 终于在公司内网环境中部署了，爬了一个晚上，今天早晨发现服务器CPU占用 100%，Nutch 爬虫仍在抓取 Redmine 管理平台下各个项目的版本库相关页面。我的天，一定是 robots.txt 缺乏相应配置造成的。 robots.txt 的配置如下：

User-agent: *
Disallow: /issues/gantt
Disallow: /issues/calendar
Disallow: /activity
Disallow: /redmine/repositories/
Disallow: /redmine/projects/redmine/repository
Disallow: /redmine/projects/redmine/issues
Disallow: /redmine/projects/redmine/activity
Disallow: /redmine/issues/gantt
Disallow: /redmine/issues/calendar
Disallow: /redmine/activity

我们知道 redmine 下每个项目的版本库浏览的 URL 为： http://bj.ossxp.com/redmine/projects/<PROJECTNAME>/repository, 难道要一一为每个项目进行配置么？看了一下王胜之前的两个博文（robots.txt参考1, robots.txt参考2），以及 wikipedia 上的相关参考，尤其是 WikiPedia 上的这句话，让我眼前一亮。

The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final '/' character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended.

如果 Disallow: 后面的字符串配置是使用的 substring，那么不就是说可以是路径中任意的一部分么，那么使用 Disallow: /repository 是不是就可以限制搜索引擎对 redmine 版本库的抓取呢？公司负责搜索的同事，看了一下 Nutch 的代码，马上定位到 RobotRulesParser.java 文件中的相关代码：

53 public class RobotRulesParser implements Configurable {

165     public boolean isAllowed(String path) {

180       int pos= 0;
181       int end= entries.length;
182       while (pos < end) {
183         if (path.startsWith(entries[pos].prefix))
184           return entries[pos].allowed;
185         pos++;
186       }

居然用的是 startsWith，就是说 robots.txt 中的 Disallow 路径被用做从头匹配 URL，这样就断然实现不了 robots.txt 规则的简化了。改之。

$ git co t/nutch_robots_parse
Switched to branch 't/nutch_robots_parse'
$ tg patch
From: Cui Rui <cuirui@bj.ossxp.com>
Subject: change the nutch robots.txt parse

change the nutch robots.txt parse

Signed-off-by: Cui Rui <cuirui@bj.ossxp.com>

---
 .../nutch/protocol/http/api/RobotRulesParser.java  |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/
index 7dd1373..5fd5839 100644
--- a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
+++ b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
@@ -180,7 +180,7 @@ public class RobotRulesParser implements Configurable {
 int pos= 0;
 int end= entries.length;
 while (pos < end) {
-        if (path.startsWith(entries[pos].prefix))
+        if (path.indexOf(entries[pos].prefix)>=0)
 return entries[pos].allowed;
 pos++;
 }

改造之后的 Nutch，抓取网站前读取 robots.txt 会对 Disallow 进行新的解析，这样就可以用下面简单的一劳永逸的 robots.txt，不再担心因为创建新项目忘了更新 robots.txt 导致每晚公司的搜索引擎造成类似拒绝服务攻击的效果。那 Google, Baidu, Bing 怎么办？他们是如何实现的？我会做些测试（测试1, 测试2, 测试3, 测试4, 测试5），等一个月后，再将测试结果补充到这里。不过 redmine 完全可以将版本库浏览对匿名用户关闭，就不怕搜索引擎抓取降低网站的响应速度了。

Go 语言解析 git config	2019-03-17	Comments
二分查找捉虫记	2016-02-29	Comments
做一个有品位的程序员	2015-12-23	Comments

World Hello

改变 Nutch 对 robots.txt 的解析实现

Related Posts