Top Rated Apache Nutch Alternatives
20 Apache Nutch Reviews
Overall Review Sentiment for Apache Nutch
Log in to view review sentiment.
data:image/s3,"s3://crabby-images/fa835/fa835700d0029abb748fdea8175e314678d2375d" alt="Navom S. Navom S."
Multidepth crawling capabilities are really good. Data extraction from web pages is remarkable. Review collected by and hosted on G2.com.
Based on Map reduce, hence slower. Adding customisations included writing plugins and building it, no feature for dependency injection. Review collected by and hosted on G2.com.
Provides an in-depth list of features, html tags, site maps Review collected by and hosted on G2.com.
Didn't have a lot of documentation at the time I was using it which made it hard to use. Review collected by and hosted on G2.com.
data:image/s3,"s3://crabby-images/fa835/fa835700d0029abb748fdea8175e314678d2375d" alt="Imtiaz S. Imtiaz S."
Easy to use.
Can crawl almost all kinds of contents.
Excellent plugin system .
Supports different storage backends. Review collected by and hosted on G2.com.
Hard to master. Requires Stiff knowledge curve.
Poor documentation. Many are outdated or broken.
Difficult to setup for production system. Review collected by and hosted on G2.com.
I used apache nutch in crawling using cygwin, in easy steps it managed to be configured and helped in collecting the desired data. Review collected by and hosted on G2.com.
I didn't see any disadvantage of it to be honest. Review collected by and hosted on G2.com.
Nutch support distributed fetching, and Hadoop support, can be multi-machine distributed fetching, storage and indexing.
Another attractive point is that it provides a plug-in framework, make it of all kinds of web content parsing, a variety of data collection, query, cluster, filtering, and other functions can be convenient to extend, it is because of this framework, the Nutch plug-in development is very easy, third-party plug-in also emerge in endlessly, greatly enhanced the function of Nutch and reputation. Review collected by and hosted on G2.com.
Nutch's crawler customization ability is relatively weak.
If the secondary development of Nutch crawler is carried out, the compilation time and debugging time of crawler will take a lot of time. Review collected by and hosted on G2.com.
data:image/s3,"s3://crabby-images/fa835/fa835700d0029abb748fdea8175e314678d2375d" alt="Justin C. Justin C."
I love how easy to configure and run it is and how it performs at scale. Storing in Hadoop is a breeze. Review collected by and hosted on G2.com.
Not quite as easy to use as tools like Scrapy. Review collected by and hosted on G2.com.
HTTP proxy support so my IP does not get block
Nice file size filter with advanced control of network bandwidth
I heard that many big companies and government agencies are using nutch in production
Nutch has parallel reducer to make use of multiple network connections and multi-core CPU Review collected by and hosted on G2.com.
I wish nutch has built-in rate limiting support
Implemented in Java which is a bit memory hungry Review collected by and hosted on G2.com.
Fetching and parsing are done separately by default, this reduces the risk of an error corrupting the fetch parse stage of a crawl with Nutch.
* Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search.
* The number of plugins for processing various document types being shipped with Nutch has been refined.
The only parser plugins shipped with Nutch now are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP.
Nutch has had scoring plugins for quite a while, and has supported things like Adaptive Fetch schedules, and all of the Nutch data is in databases and so forth that are interrogated through the command line tools, Java, and now there is an emerging REST interface and also work to create a Python client for this as well. Review collected by and hosted on G2.com.
Nutch doesn't have to be batch mode.
So lets say that as a Nutch crawl administrator your client has tasked you with the following "Get me domain specific material from a database such as NTIS" (NTIS; the National Technical Information Service, serves as the largest central resource for government-funded scientific, technical, engineering, and business related information available today.) What this really translates to is the following:
Review collected by and hosted on G2.com.