博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
第 10 章 Nutch
阅读量:6427 次
发布时间:2019-06-23

本文共 3177 字,大约阅读时间需要 10 分钟。

http://lucene.apache.org/nutch/

How to Setup Nutch and Hadoop

http://wiki.apache.org/nutch/NutchHadoopTutorial

  1. 下载

    $ cd /usr/local/src/$ wget http://apache.etoak.com/lucene/nutch/nutch-1.0.tar.gz$ tar zxvf nutch-1.0.tar.gz$ sudo cp -r nutch-1.0 ..$ cd ..$ sudo ln -s nutch-1.0 apache-nutch
  2. 创建文件myurl

    $ cd apache-nutch$ mkdir urls$ vim urls/myurlhttp://netkiller.8800.org/
  3. 配置文件 crawl-urlfilter.txt

    编辑conf/crawl-urlfilter.txt文件,修改MY.DOMAIN.NAME部分,把它替换为你想要抓取的域名

    $ cp conf/crawl-urlfilter.txt conf/crawl-urlfilter.txt.old$ vim conf/crawl-urlfilter.txt# accept hosts in MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/修改为:# accept hosts in MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*netkiller.8800.org/
  4. http.agent.name

    $ vim conf/nutch-site.xml
    http.agent.name
    Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1) Gecko/20090624 Firefox/3.5
    HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately.
    http.agent.description
    Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name.
    http.agent.url
    http://netkiller.8800.org/robot.html
    A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler.
    http.agent.email
    openunix@163.com
    An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming.
  5. 运行以下命令行开始工作

    $ bin/nutch crawl urls -dir crawl -depth 3 -threads 5
    bin/nutch crawl 
    -dir
    -depth 2 -threads 4 >&logs/logs1.logurls 存放需要爬行的url文件的目录,即目录/nutch/urls。-dir dirnames 设置保存所抓取网页的目录.-depth depth 表明抓取网页的层次深度-delay delay 表明访问不同主机的延时,单位为“秒”-threads threads 表明需要启动的线程数-topN 50 topN 一个网站保存的最大页面数。$ nohup bin/nutch crawl /usr/local/apache-nutch/urls -dir /usr/local/apache-nutch/crawl -depth 5 -threads 50 -topN 50 > /tmp/nutch.log &
  6. depoly

    $ cd /usr/local/apache-tomcat/conf/Catalina/localhost$ vim nutch.xml

    searcher.dir

    $ vim /usr/local/apache-tomcat/webapps/nutch/WEB-INF/classes/nutch-site.xml
    searcher.dir
    /usr/local/apache-nutch/crawl

    test

    http://172.16.0.1:8080/nutch/

原文出处:Netkiller 系列 手札
本文作者:陈景峯
转载请与作者联系,同时请务必标明文章原始出处和作者信息及本声明。

你可能感兴趣的文章
windows server之AD(1)
查看>>
如何升级PowerShell
查看>>
oracle kill所有plsql developer进程
查看>>
python实现登录查询(可以模糊查询)
查看>>
LAMP架构(apache用户认证,域名重定向,apache访问日志)
查看>>
struts2.0的json操作
查看>>
SQL注入神器——sqlmap
查看>>
Unity导航 (寻路系统Nav Mesh Agent)
查看>>
SaltStack配置语法-YAML和Jinja
查看>>
运用免费OA让你有意想不到的效果
查看>>
一些软件设计软则
查看>>
Linux运维基础命令
查看>>
使用PowerShell配置IP地址
查看>>
第十一章 MySQL运算符
查看>>
JAVA常见算法题(十七)
查看>>
GUI鼠标相关设置
查看>>
使用 <Iframe>实现跨域通信
查看>>
闭包--循序学习
查看>>
项目实战之集成邮件开发
查看>>
解决C3P0在Linux下Failed to get local InetAddress for VMID问题
查看>>