2022年nutch_+hadoop分布式部署 .pdf
《2022年nutch_+hadoop分布式部署 .pdf》由会员分享,可在线阅读,更多相关《2022年nutch_+hadoop分布式部署 .pdf(5页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、nutch1_3 +hadoop 分布式部署 ( 亲测 ) 1. 确保 hadoop正常启动2. 下载 nutch1.3 安装包 解压到指定路径3. 抓取 nutch1.3 有两个 conf 一个在 NUTCH_HOME/conf , 另一个在rumtime/local/conf runtime/local/conf 为 local(本地抓取的配置文件所用 ) NUTCH_HOME/conf 为分布式抓取所用下面我们着重讲解分布式抓取4. 分布式抓取: rutime/deply/bin/nutch下执行分布式抓取命令(分布式抓取一定是在这个下面, local为本地抓取所用 ) chmod +x
2、 bin/nutch 赋予执行权限5. 拷贝 hadoop环境将 HADOOP_HOME/conf下的 6 个文件: core-site.xml hadoop-env.sh hdfs-site.xml mapred-site.xml masters slaves 拷贝到 NUTCH_HOME/conf下6. 配置 nutch-site.xml 和 nutch-default.xml 简单配置一个 http.agent.name 即可 http.agent.name MyCrawl001 7. 配置 regex-urlfilter.txt抓取动态网页 # skip file: ftp: and
3、mailto: urls -(file|ftp|mailto): 名师资料总结 - - -精品资料欢迎下载 - - - - - - - - - - - - - - - - - - 名师精心整理 - - - - - - - 第 1 页,共 5 页 - - - - - - - - - # skip image and other suffixes we cant yet parse -.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
4、 # skip URLs containing certain characters as probable queries, etc. +?*!= # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/+)/+1/+1/ # accept anything else +. 8.native/lib 这是配置 nutch1.3 hadoop集群最重要一点下面是 NUTCH_HOME/lib/native 下的 文档 README.txt These libraries are pu
5、rely optional, and if they are missing Hadoop will use corresponding pure Java components. The impact of native compression becomes noticeable with larger datasets and weaker CPU-s - if you notice that the CPU is routinely saturated when a job is sorting or reducing, then using these libs may help.
6、Installation instructions = You can obtain the necessary files from a distribution package of Hadoop, e.g. hadoop-0.20.2.tar.gz. Unpack this archive, and copy the content of lib/native here, so that the layout looks like this: /lib/native/Linux-amd64-64/. /lib/native/Linux-i386-32/. Local runtime -
7、The build process will include these native libraries when preparing the /runtime/local environment for running in local mode. /runtime/local/bin/nutch knows how to use these libs - if they are found and correctly used you should see lines like this in your logs: Distributed runtime - If you want to
8、 use this component in an existing Hadoop cluster (when using 名师资料总结 - - -精品资料欢迎下载 - - - - - - - - - - - - - - - - - - 名师精心整理 - - - - - - - 第 2 页,共 5 页 - - - - - - - - - /runtime/deploy artifacts) you need to make sure these files are placed in Hadoop/lib/native directory on each node, and then rest
9、art the cluster. If you installed the cluster from a distribution package of Hadoop then these libraries should already be in the right place and you shouldnt need to do anything else. 大体意思就是说可以将 HADOOP_HOME下的 lib/native中的文件Linux-amd64-64 Linux-i386-32 拷贝到 NUTCH_HOME/lib/native 下( 按英文原意要确保you need t
10、o make sure these files are placed in Hadoop/lib/native directory on each node, and then restart the cluster 确保这个文件在每一个节点上并且重启集群 , 我拷贝了 ,) 9. 执行 runtime/deply/bin/nutch crawl hdfs:/server0:9000/user/suse/urls -dir crawl -depth 200 -threads 200 -topN 1000 10. 成功看到 map-reduce 任务成功执行则配置成功 11/08/22 16:3
11、3:26 INFO mapred.JobClient: Reduce input records=48148 11/08/22 16:33:26 INFO crawl.CrawlDb: CrawlDb update: finished at 2011-08-22 16:33:26, elapsed: 00:00:39 11/08/22 16:33:26 INFO crawl.Generator: Generator: starting at 2011-08-22 16:33:26 11/08/22 16:33:26 INFO crawl.Generator: Generator: Select
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 2022年nutch_+hadoop分布式部署 2022 nutch_ hadoop 分布式 部署
限制150内