Wednesday, May 4, 2011

Nutch 2.0

A really good tutorial to run Nutch 2.0 can be found here: http://techvineyard.blogspot.com/2010/12/build-nutch-20.html

It is very important to make the changes mentioned in the tutorial, like changing the gora version and the ivy files with the ivy library.


Tuesday, April 19, 2011

Problem with Luke and Nutch1.2

I just tried to download the current Luke version from the website. There comes the error, "format version: 2 expected 1 or lower", so the 0.9.9 luke version downloaded did not support nutch 1.2.
After a short search I found this post which tells us to get the 1.0.1 version from google code.
That does the job, as alternativ you can downgrade your nutch - lucene version to 2.9.

Gettin Nutch running with windows

I used the following tutorial, until the Intranet Crawling part pretty straight forward.
Than there comes the first error message in cygwin JAVA_HOME not set, so fix by executting: 'export JAVA_HOME=/cygdrive/c/Path/To/The/JRE/'.

Here comes the next error message "No http.agent.name set", so by asking the orignal tutorial one gets to know that you have to change it in the conf/nutch-default.xml.

Afterwards the crawling will start, except from a small warning "agent.name should be first in robot.agents"

First run ends in a "path not found" exception, strange behaviour because like 10 segments work properly. (apparently it wasn't a good idea to change the depth to 10 with depth 3 its running throw).

Finally 'bin/nutch crawl urls.txt -dir crawl -depth 10 -topN 100' did the job, the deploying on tomcat runs without any problems, but don't forget to restat after changing the properties.
PS: It is more easy to just copy the nutch war file into the webapp directory wait for the auto-deploy and change the settings in the webapps/nutch/WEB-INF/classes/nutch-default.xml