This blog is about collection informations on working with nutch during our master project.
Wednesday, May 4, 2011
Nutch 2.0
A really good tutorial to run Nutch 2.0 can be found here: http://techvineyard.blogspot.com/2010/12/build-nutch-20.html
Tuesday, April 19, 2011
Problem with Luke and Nutch1.2
I just tried to download the current Luke version from the website. There comes the error, "format version: 2 expected 1 or lower", so the 0.9.9 luke version downloaded did not support nutch 1.2.
After a short search I found this post which tells us to get the 1.0.1 version from google code.
That does the job, as alternativ you can downgrade your nutch - lucene version to 2.9.
After a short search I found this post which tells us to get the 1.0.1 version from google code.
That does the job, as alternativ you can downgrade your nutch - lucene version to 2.9.
Gettin Nutch running with windows
I used the following tutorial, until the Intranet Crawling part pretty straight forward.
Than there comes the first error message in cygwin JAVA_HOME not set, so fix by executting: 'export JAVA_HOME=/cygdrive/c/Path/To/The/JRE/'.
Here comes the next error message "No http.agent.name set", so by asking the orignal tutorial one gets to know that you have to change it in the conf/nutch-default.xml.
Afterwards the crawling will start, except from a small warning "agent.name should be first in robot.agents"
First run ends in a "path not found" exception, strange behaviour because like 10 segments work properly. (apparently it wasn't a good idea to change the depth to 10 with depth 3 its running throw).
Finally 'bin/nutch crawl urls.txt -dir crawl -depth 10 -topN 100' did the job, the deploying on tomcat runs without any problems, but don't forget to restat after changing the properties.
PS: It is more easy to just copy the nutch war file into the webapp directory wait for the auto-deploy and change the settings in the webapps/nutch/WEB-INF/classes/nutch-default.xml
Than there comes the first error message in cygwin JAVA_HOME not set, so fix by executting: 'export JAVA_HOME=/cygdrive/c/Path/To/The/JRE/'.
Here comes the next error message "No http.agent.name set", so by asking the orignal tutorial one gets to know that you have to change it in the conf/nutch-default.xml.
Afterwards the crawling will start, except from a small warning "agent.name should be first in robot.agents"
First run ends in a "path not found" exception, strange behaviour because like 10 segments work properly. (apparently it wasn't a good idea to change the depth to 10 with depth 3 its running throw).
Finally 'bin/nutch crawl urls.txt -dir crawl -depth 10 -topN 100' did the job, the deploying on tomcat runs without any problems, but don't forget to restat after changing the properties.
PS: It is more easy to just copy the nutch war file into the webapp directory wait for the auto-deploy and change the settings in the webapps/nutch/WEB-INF/classes/nutch-default.xml
Monday, April 18, 2011
Interesting sites related to Nutch
Crawl Script
How to implement re-crawling
Nutch Homepage
How Nutch maps to "Map and Reduce"
What is MapReduce?
http://www.slideshare.net/abial/nutch-webscale-search-engine-toolkit
Talks on Search, Lucene and Performance
Slides on Nutch
Paper on Nutch performance and use cases
Tutorial on Nutch
How to setup an Hadoop cluster?
Post to plugin for a self-made Language Detection
Plugin.xml showing all extension points
Scaling of Nutch and Lucene
Talk on Hadoop and fellow Apache projects
Nice interview with Doug Cutting
Work in Progress...
How to implement re-crawling
Nutch Homepage
How Nutch maps to "Map and Reduce"
What is MapReduce?
http://www.slideshare.net/abial/nutch-webscale-search-engine-toolkit
Talks on Search, Lucene and Performance
Slides on Nutch
Paper on Nutch performance and use cases
Tutorial on Nutch
How to setup an Hadoop cluster?
Post to plugin for a self-made Language Detection
Plugin.xml showing all extension points
Scaling of Nutch and Lucene
Talk on Hadoop and fellow Apache projects
Nice interview with Doug Cutting
Work in Progress...
Subscribe to:
Posts (Atom)