The Source for Java Technology Collaboration
User: Password:



Tom White's Blog

Tom White Tom White is lead Java developer at Kizoom, a leading U.K. software company in the delivery of personalized travel information. He has been writing Java full time since 1996, and writing about Java since 2003 for O'Reilly and IBM's developerWorks. He is currently interested in distributed computing with Jini, and text engineering with Java. Outside programming Tom enjoys making his young daughters laugh, and watching 1930s Hollywood films.



User-Friendly XML Config

Posted by tomwhite on October 27, 2005 at 02:49 PM | Permalink | Comments (2)

There's been a bit of a backlash against XML config files lately. The Ruby On Rails community has a crisp putdown: avoid "doing XML sit-ups". And the arrival of Java annotations in J2SE 5.0 seemed to muddy the waters somewhat - do I still need to use XML config? Well yes, they are really designed for different things (as Dennis Sosnoski points out).

So if XML config files are going to continue to be a part of a Java developer's life, it's worth making them a little easier to work with. A trick that I noticed in one of Nutch's XML config files was the use of an xml-stylesheet processing instruction to render the file in a nice table when viewed in a browser:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<nutch-conf>
...
</nutch-conf>

The stylesheet, nutch-conf.xsl, simply transforms the configuration XML into an HTML table. You can view the result here. It's a technique that is used elsewhere - quite a few RSS feeds do it, for example, such as the BBC. In the context of Java XML config, I think it's a good way to make large config files more approachable. If you're new to a project it's easy to open the file in a browser and quickly scan through it. It's certainly easier than grokking the XML. So if you're writing an XML config file - please add some style!



MapReduce

Posted by tomwhite on September 25, 2005 at 10:36 PM | Permalink | Comments (2)

Doug Cutting has done it again. The creator of Lucene and Nutch has implemented (with Mike Cafarella and others) a distributed platform for high volume data processing called MapReduce.

MapReduce is the brainchild of Google and is very well documented by Jeffrey Dean and Sanjay Ghemawat in their paper MapReduce: Simplified Data Processing on Large Clusters. In essence, it allows massive data sets to be processed in a distributed fashion by breaking the processing into many small computations of two types: a map operation that transforms the input into an intermediate representation, and a reduce function that recombines the intermediate representation into the final output. This processing model is ideal for the operations a search engine indexer like Nutch or Google needs to perform - like computing inlinks for URLs, or building inverted indexes - and it will transform Nutch into a scalable, distributed search engine.

Nutch MapReduce takes advantage of the Nutch Distributed File System (NDFS) - itself inspired by another Google Labs project, the Google File System. NDFS provides a fault-tolerant environment for working with very large files using cheap commodity hardware.

Currently MapReduce is a part of Nutch, but it has been proposed that it and NDFS be moved into a separate project. However, it is perfectly possible to use the MapReduce functionality in Nutch for your own data processing. In this blog, I'll briefly describe how to get started.

Continue Reading...



November 2005
Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30      


Search this blog:
  

Archives

October 2005
September 2005
July 2005
May 2005
April 2005
March 2005
February 2005

Recent Entries

User-Friendly XML Config

MapReduce

Good Behaviour

Articles

Did You Mean: Lucene?
All modern search engines attempt to detect and correct spelling errors in users' search queries. This article shows you one way of adding a "did you mean" suggestion facility to your own search applications using the Lucene Spell Checker. Aug. 9, 2005

How To Build a ComputeFarm
Parallel computing allows some programs to run faster by dividing them up into smaller pieces and running these pieces on multiple processors. ComputeFarm is an open source Java framework for developing and running parallel programs. Apr. 21, 2005

All articles by Tom White »



Powered by
Movable Type 3.01D


 XML java.net RSS