Tom White's Blog

My pages	Projects	Communities	java.net

Get Involved

java-net Project

Request a Project

Project Help Wanted Ads

Publicize your Project

Get Informed

java.net Online Books

java.net Archives

Get Connected

java.net Forums

Wiki and Javapedia

People and Partners

Java User Groups

RSS Feeds

Search

Online Books:

Tom White's Blog

Tom White is lead Java developer at Kizoom, a leading U.K. software company in the delivery of personalized travel information. He has been writing Java full time since 1996, and writing about Java since 2003 for O'Reilly and IBM's developerWorks. He is currently interested in distributed computing with Jini, and text engineering with Java. Outside programming Tom enjoys making his young daughters laugh, and watching 1930s Hollywood films.

User-Friendly XML Config

Posted by tomwhite on October 27, 2005 at 02:49 PM | Permalink | Comments (2)

There's been a bit of a backlash against XML config files lately. The Ruby On Rails community has a crisp putdown: avoid "doing XML sit-ups". And the arrival of Java annotations in J2SE 5.0 seemed to muddy the waters somewhat - do I still need to use XML config? Well yes, they are really designed for different things (as Dennis Sosnoski points out).

So if XML config files are going to continue to be a part of a Java developer's life, it's worth making them a little easier to work with. A trick that I noticed in one of Nutch's XML config files was the use of an xml-stylesheet processing instruction to render the file in a nice table when viewed in a browser:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<nutch-conf>
...
</nutch-conf>

The stylesheet, nutch-conf.xsl, simply transforms the configuration XML into an HTML table. You can view the result here. It's a technique that is used elsewhere - quite a few RSS feeds do it, for example, such as the BBC. In the context of Java XML config, I think it's a good way to make large config files more approachable. If you're new to a project it's easy to open the file in a browser and quickly scan through it. It's certainly easier than grokking the XML. So if you're writing an XML config file - please add some style!

MapReduce

Posted by tomwhite on September 25, 2005 at 10:36 PM | Permalink | Comments (2)

Doug Cutting has done it again. The creator of Lucene and Nutch has implemented (with Mike Cafarella and others) a distributed platform for high volume data processing called MapReduce.

MapReduce is the brainchild of Google and is very well documented by Jeffrey Dean and Sanjay Ghemawat in their paper MapReduce: Simplified Data Processing on Large Clusters. In essence, it allows massive data sets to be processed in a distributed fashion by breaking the processing into many small computations of two types: a map operation that transforms the input into an intermediate representation, and a reduce function that recombines the intermediate representation into the final output. This processing model is ideal for the operations a search engine indexer like Nutch or Google needs to perform - like computing inlinks for URLs, or building inverted indexes - and it will transform Nutch into a scalable, distributed search engine.

Nutch MapReduce takes advantage of the Nutch Distributed File System (NDFS) - itself inspired by another Google Labs project, the Google File System. NDFS provides a fault-tolerant environment for working with very large files using cheap commodity hardware.

Currently MapReduce is a part of Nutch, but it has been proposed that it and NDFS be moved into a separate project. However, it is perfectly possible to use the MapReduce functionality in Nutch for your own data processing. In this blog, I'll briefly describe how to get started.

November 2005

Search this blog:

Articles

Did You Mean: Lucene?
All modern search engines attempt to detect and correct spelling errors in users' search queries. This article shows you one way of adding a "did you mean" suggestion facility to your own search applications using the Lucene Spell Checker. Aug. 9, 2005

How To Build a ComputeFarm
Parallel computing allows some programs to run faster by dividing them up into smaller pieces and running these pieces on multiple processors. ComputeFarm is an open source Java framework for developing and running parallel programs. Apr. 21, 2005

All articles by Tom White »

Powered by
Movable Type 3.01D

java.net RSS

Tom White's Blog

User-Friendly XML Config

MapReduce

November 2005

Search this blog:

Archives

Recent Entries

Articles