Experience on working with nutch

Monday, September 3, 2012

Hadoop could not obtain block

Just got a new exciting exception

java.lang.IllegalStateException: hdfs://mypath0/part-03511
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator$1.apply(SequenceFileDirValueIterator.java:131)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator$1.apply(SequenceFileDirValueIterator.java:1)
at com.google.common.collect.Iterators$9.transform(Iterators.java:845)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at com.google.common.collect.Iterators$6.hasNext(Iterators.java:583)
at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:208)
at org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:28)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.IOException: Could not obtain block: blk_-3246019420168585051_14555 file=/mypath0/part-03511
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2269)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2063)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224)
at java.io.DataInputStream.readFully(DataInputStream.java:178)

Looking into your datanode log you could find something like that

2012-03-03 14:28:55,095 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(112.28.22.231:50010, storageID=DS-314214910-127.0.0.2-50010-1346671568278, infoPort=50075, ipcPort=50020):DataXceiver
java.io.IOException: xceiverCount 275 exceeds the limit of concurrent xcievers 256
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:92)
at java.lang.Thread.run(Thread.java:662)

That could be a configuration issue of your data node. Try to set the xcievers property higher than 256 in your hdfs-site.xml.
Check the hadoop book page for detailed information!

Wednesday, May 4, 2011

Nutch 2.0

A really good tutorial to run Nutch 2.0 can be found here: http://techvineyard.blogspot.com/2010/12/build-nutch-20.html

It is very important to make the changes mentioned in the tutorial, like changing the gora version and the ivy files with the ivy library.

Tuesday, April 19, 2011

Problem with Luke and Nutch1.2

I just tried to download the current Luke version from the website. There comes the error, "format version: 2 expected 1 or lower", so the 0.9.9 luke version downloaded did not support nutch 1.2.
After a short search I found this post which tells us to get the 1.0.1 version from google code.
That does the job, as alternativ you can downgrade your nutch - lucene version to 2.9.

Gettin Nutch running with windows

I used the following tutorial, until the Intranet Crawling part pretty straight forward.
Than there comes the first error message in cygwin JAVA_HOME not set, so fix by executting: 'export JAVA_HOME=/cygdrive/c/Path/To/The/JRE/'.

Here comes the next error message "No http.agent.name set", so by asking the orignal tutorial one gets to know that you have to change it in the conf/nutch-default.xml.

Afterwards the crawling will start, except from a small warning "agent.name should be first in robot.agents"

First run ends in a "path not found" exception, strange behaviour because like 10 segments work properly. (apparently it wasn't a good idea to change the depth to 10 with depth 3 its running throw).

Finally 'bin/nutch crawl urls.txt -dir crawl -depth 10 -topN 100' did the job, the deploying on tomcat runs without any problems, but don't forget to restat after changing the properties.
PS: It is more easy to just copy the nutch war file into the webapp directory wait for the auto-deploy and change the settings in the webapps/nutch/WEB-INF/classes/nutch-default.xml