Preserving formatting in 'doc -> html -> docx' conversion
Posted by Ross  · 29-01-2016 - 02:18

I'm using phpdocx to allow users to upload a doc or docx file, which then converts the file to html, allows the user to edit the html, then exports the html converting it back to a docx file.

During this double conversion process, formatting is always lost.  Sometimes it is minor but sometimes it is major and renders the final result useless.

The biggest issue is when .doc is used as the initial file.  I notice less issues when using .docx. But even within .docx some of the basic stuff like font sizes are changed.  For example, in my original docx file, the body text is set to 11, but in the final output, the body font size ends up being 12.  The font sizes on headings also changes.  Various section headings, table of contents, etc. are all wrong.  Indents are missing.  The list goes on...

I'm using transformDocument on the original file to convert it to html. Then I'm using embedHTML to create a new docx file from the html file.

Is this preventable at all or just the nature of the conversion process?  Could this be setup differently?  It should be noted that some of the doc/docx files may be 100+ pages long with various formatting settings within, mainly legal documents.

I'm also using LibreOffice with the conversion plugin.