|
|
|||||||||||||||||||||||||||||||||||||||||||||
Tom White's Blog
User-Friendly XML ConfigPosted by tomwhite on October 27, 2005 at 02:49 PM | Permalink | Comments (2)There's been a bit of a backlash against XML config files lately. The Ruby On Rails community has a crisp putdown: avoid "doing XML sit-ups". And the arrival of Java annotations in J2SE 5.0 seemed to muddy the waters somewhat - do I still need to use XML config? Well yes, they are really designed for different things (as Dennis Sosnoski points out).
So if XML config files are going to continue to be a part of a Java developer's life, it's worth making them a little easier to work with. A trick that I noticed in one of Nutch's XML config files was the use of an <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?> <nutch-conf> ... </nutch-conf>
The stylesheet, MapReducePosted by tomwhite on September 25, 2005 at 10:36 PM | Permalink | Comments (2)Doug Cutting has done it again. The creator of Lucene and Nutch has implemented (with Mike Cafarella and others) a distributed platform for high volume data processing called MapReduce. MapReduce is the brainchild of Google and is very well documented by Jeffrey Dean and Sanjay Ghemawat in their paper MapReduce: Simplified Data Processing on Large Clusters. In essence, it allows massive data sets to be processed in a distributed fashion by breaking the processing into many small computations of two types: a map operation that transforms the input into an intermediate representation, and a reduce function that recombines the intermediate representation into the final output. This processing model is ideal for the operations a search engine indexer like Nutch or Google needs to perform - like computing inlinks for URLs, or building inverted indexes - and it will transform Nutch into a scalable, distributed search engine. Nutch MapReduce takes advantage of the Nutch Distributed File System (NDFS) - itself inspired by another Google Labs project, the Google File System. NDFS provides a fault-tolerant environment for working with very large files using cheap commodity hardware. Currently MapReduce is a part of Nutch, but it has been proposed that it and NDFS be moved into a separate project. However, it is perfectly possible to use the MapReduce functionality in Nutch for your own data processing. In this blog, I'll briefly describe how to get started. |
November 2005
Search this blog:Archives
October 2005 Recent EntriesArticlesDid You Mean: Lucene? How To Build a ComputeFarm | ||||||||||||||||||||||||||||||||||||||||||||
|
|