Tuesday, April 19, 2011

Problem with Luke and Nutch1.2

I just tried to download the current Luke version from the website. There comes the error, "format version: 2 expected 1 or lower", so the 0.9.9 luke version downloaded did not support nutch 1.2.
After a short search I found this post which tells us to get the 1.0.1 version from google code.
That does the job, as alternativ you can downgrade your nutch - lucene version to 2.9.

Gettin Nutch running with windows

I used the following tutorial, until the Intranet Crawling part pretty straight forward.
Than there comes the first error message in cygwin JAVA_HOME not set, so fix by executting: 'export JAVA_HOME=/cygdrive/c/Path/To/The/JRE/'.

Here comes the next error message "No http.agent.name set", so by asking the orignal tutorial one gets to know that you have to change it in the conf/nutch-default.xml.

Afterwards the crawling will start, except from a small warning "agent.name should be first in robot.agents"

First run ends in a "path not found" exception, strange behaviour because like 10 segments work properly. (apparently it wasn't a good idea to change the depth to 10 with depth 3 its running throw).

Finally 'bin/nutch crawl urls.txt -dir crawl -depth 10 -topN 100' did the job, the deploying on tomcat runs without any problems, but don't forget to restat after changing the properties.
PS: It is more easy to just copy the nutch war file into the webapp directory wait for the auto-deploy and change the settings in the webapps/nutch/WEB-INF/classes/nutch-default.xml