Howard Chu hyc@symas.com:
Hm. I use my own man2html http://highlandsun.com/hyc/man2html.c which gives pretty good looking output for us. Really, if you're developing a tool that claims to read troff input, it has to actually do so. I mean, the point of tools such as this is to be able to convert existing documents without modifying them, isn't it?
There's a subtle difference between tools that translate purely at a presentation level and tools that do content analysis.
A purely presentation-level translation such as your man2html is indeed less likely to be thrown by weird troff markup. And much of the time it will produce markup that doesn't look bad, especially on a relatively small collection of pages in a consistent house style.
But there are things such a tool cannot do that become more important when you are translating a very large corpus of man pages with multiple authors - such as an entire Linux distribution's man page tree.
Here is an example: the treatment of file paths in FILES sections. Some man-page authors mark them up as bold text. Some mark them up as italic. And some give them no highlight at all.
A purely presentation-level translator will simply translate any font change from troff to HTML. In the resulting output, filenames will have three different visual signatures. Readers will thus have to work a little harder than they should to recognize filenames.
A tool that does content analysis, on the other hand, can recognize presentation-level cliches that mean "this is a filename", such as a line in FILES beginning with .B or .I and containing a /, and map it to a DocBook <filename> pair. The DocBook stylesheet will then ensure that all filenames are visually marked in the *same* way in generated HTML, rather than three different ways.
Now multiply this effect by all the different things that can be recognized by content - things like Unix error codes, program listings, command synopses, C function prototypes, references to other manual pages.
The effect is a higher-quality and more visually uniform translation. The gain in quality increases with the diversity of the author population. For very large collections like an entire Linux distribution's man page tree it is quite significant.
The tradeoff is that, while a presentation-level translator will cheerfully produce a visual garble from ill-formed troff, a tool with a real parser and a content analyzer will have a few more cases in which it just can't cope at all.
Not *many* cases, mind you; in the decade I've been developing mine, perhaps 2-3%. But these are worth fixing anyway, because they're likely to break third-party man-page readers. Nothing but troff itself interprets troff perfectly.