Tuesday, April 19, 2011

Gettin Nutch running with windows

I used the following tutorial, until the Intranet Crawling part pretty straight forward.
Than there comes the first error message in cygwin JAVA_HOME not set, so fix by executting: 'export JAVA_HOME=/cygdrive/c/Path/To/The/JRE/'.

Here comes the next error message "No http.agent.name set", so by asking the orignal tutorial one gets to know that you have to change it in the conf/nutch-default.xml.

Afterwards the crawling will start, except from a small warning "agent.name should be first in robot.agents"

First run ends in a "path not found" exception, strange behaviour because like 10 segments work properly. (apparently it wasn't a good idea to change the depth to 10 with depth 3 its running throw).

Finally 'bin/nutch crawl urls.txt -dir crawl -depth 10 -topN 100' did the job, the deploying on tomcat runs without any problems, but don't forget to restat after changing the properties.
PS: It is more easy to just copy the nutch war file into the webapp directory wait for the auto-deploy and change the settings in the webapps/nutch/WEB-INF/classes/nutch-default.xml


1 comment: