A Word 8 converter for Unix
What is it
MSWordView is a program that can understand the microsofts word 8 binary file format (office97),
it currently converts word into html, which can then be read with a browser.
MSWordView is being actively worked on, and will be pretty bleeding edge for the next few weeks, bear with me.
Current Features include
- ability to understand fastsaved files as well as non-fastsaved files.
- conversion of word header paragraph style into
appropiate header levels of html.
- support of font attributes such as italic, bold, underline, subscript, superscript, font size, animated text, all caps[1], small caps[1], font face[1] and colour into html tags
- conversion of word tables into html tables, features now include background color and background patterns, also table width and height are supported.
- conversion of ms symbol and wingding font into gif pics for html output, so math done directly
in word shows up fairly alright, note not equation editor, thats an ole embedded type
- encoding of non-west-european ascii languages into utf-8 encoding, which should work
with at least netscape.
- conversion of footnotes to html linked text.
- understands headers and footers. odd even and titlepages
- understands sections so that section numbering restarts if needs be,
is the right type of numbering i.e 1 vs i vs a, and sections get the right
headers and footers.
- conversion of lists and multilevel lists.
- extraction of some pictures now supported, gifs/jpgs/pngs inserted through
the insert->picture->from file mechanism work!! as do some other methods.
- paragraph alignment through centering or right justification is supported, other amounts of
indentation is still not supported.
Currently Non Supported Features include
- all caps, small caps and font face arent done when the language cannot be guaranteed
to be ascii based (western european) language.
- no office draw stuff, graphics that arent gif/png/jpg as of the moment, or equation editor or other ole embedded types.
- not wysiwhg output, watch out for stuff where a heading level in word
has had its point size shrunk to look like ordinary text,, thats going
to look wrong in html with default options (use -h if you get this), fonts
might look too large in default output use -f pointsize (e.g -f 12) for this problem.
- pagenumbering works off hard page breaks, the one the user puts in with insert break, so
pagenumbering mightnt be exactly as you would want it to be.
- indentation of lists doesnt work, for the same reasons that theres no
real layout being preserved from doc to html.
- fully correct conversion of tab stops and other formatting done by the user done with whitespace, again
indentation done with this is just going to be broken in html no matter what anyone does.
- word 6 and 7 etc arent currently supported, just word 8. mswordview cant understand
these formats as they're somewhat different.
I will be working on the unsupported features, but as its already fairly useful, im releasing it. Also it only does word 8, not word 6 and/or
word 7, i will be adding word 6 capabilities to it as well, and if i get lucky word 7.
This is to be considered early beta software as theres loads to be done and many
bits and bobs to be fixed and supported.
What do you need
Just the source
Web Gateway
Demo mswordview here, dont use this to convert information you wouldnt want me to
see, coz if the conversion doesnt work, ill be using the file you convert
to try and extend what mswordview can support, which will require me to read
it. This script is broken for non ascii languages, mswordview supports them
but the utf-8 is getting stripped somewhere in the web interface to it.
More Info
MsWordView used to use laola to break the word file up into its
ole streams, but now uses custom c code that is included in the distribution, after that the word specification that microsoft has made available
is followed to extract the text and paragraph properties, i.e whether we are in a table or not.
How to Obtain Microsoft Office File Formats
The MS Office file formats (Word, Excel, Powerpoint, Office Binder and
Office Drawing) are all freely available from the MS web site provided you
are a member of the MS Developer Network (MSDN). Joining MSDN is free to gain
access to these specifications
Simply go to the following address:
http://msdn.microsoft.com
From the list on the left of the screen select MSDN library online
If you are not a member of the MS Developer Network you will need to join - it's free.
Once you have subscribed to the MSDN, you can obtain online copies of the file formats. To do this, follow these steps:
1.On the MSDN World Wide Web site, click MSDN Library Online.
2.Under Member Area, click the Library Online tab.
3.Double-click Microsoft Office Development.
4.Double-click Office.
5.Double-click Microsoft Office 97 Binary File Formats.
6.Select the format you are interested in (Word, Excel, Powerpoint, etc.)
There is a definite need for converters for the other msoffice products. In
relation to this converter ms office draw is needed, so go out there and work
on it.
Other Decoders and related projects
There already exist a few attempts as word converters
laola (originally used by mswordview) includes one called elser, doesnt handle word 8, but can do word 6 and 7
word2x, which is for word 6 and doesnt do fastsaves
catdoc, which doesnt do fastsaves or tables, also for word 6.
all these converters are almost magical in how far they managed to go without access to the
microsoft format specification, and their code was terribly useful in figuring out some things
Sun has something which displays
word files on screen, though it doesnt print
Corels word processor for linux, has a very good
converter for word6/7/8 built in. Its has had a few mistakes in conversion, but unlike current mswordview it retains
formatting very very well.
Use wine and the ms 16bit word viewer, heres a howto.
the filters project.
A word macro investigation tool
- Source last version as of Thu Oct 29 18:28
Warning, mswordview no longer outputs to standard output by default
Remember this is a work in progress, its not finished yet and may show bugs.
Known Bugs
i reckon that theres loads of problems with more complex
docs, and theres stacks of codes i havent implemented yet, often unknown
graphics are spat out, which are incorrect, if the graphic name says unknown
then its an unsupported graphic type.
Heres my CHANGELOG, keep track of it for news and updates
what im working on etc.
Mailing List
an incredibly low volume mailing list for announcements has been set up for mswordview (Aug 24th 1998)
to subscribe send email to mswordview-subscribe@makelist.com
to unsubscribe send email to mswordview-unsubscribe@makelist.com
the address of the list itself is mswordview@makelist.com
the list archive is at
http://www.findmail.com/list/mswordview/
What would be nice to get
- the word 7 (office 95), 4, 3 & 2 formats, i have the others.
- someone to implement decoders for excel, access, powerpoint & office draw.
- theres someone working on excel i see here.
- a converter for wmf to something else useful, theres a gimp plugin that
would be a good starting location.
- a converter for equation editor stuff to tex, or something else
- a nice logo, i cant draw too well.
- some sponsership :-), id love some of that cool mindstorms lego.