Forum


Replies: 2   Views: 2272
Preserving formatting in 'doc -> html -> docx' conversion
Topic closed:
Please note this is an old forum thread. Information in this post may be out-to-date and/or erroneous.
Every phpdocx version includes new features and improvements. Previously unsupported features may have been added to newer releases, or past issues may have been corrected.
We encourage you to download the current phpdocx version and check the Documentation available.

Posted by Ross  · 29-01-2016 - 02:18

I'm using phpdocx to allow users to upload a doc or docx file, which then converts the file to html, allows the user to edit the html, then exports the html converting it back to a docx file.

During this double conversion process, formatting is always lost.  Sometimes it is minor but sometimes it is major and renders the final result useless.

The biggest issue is when .doc is used as the initial file.  I notice less issues when using .docx. But even within .docx some of the basic stuff like font sizes are changed.  For example, in my original docx file, the body text is set to 11, but in the final output, the body font size ends up being 12.  The font sizes on headings also changes.  Various section headings, table of contents, etc. are all wrong.  Indents are missing.  The list goes on...

I'm using transformDocument on the original file to convert it to html. Then I'm using embedHTML to create a new docx file from the html file.

Is this preventable at all or just the nature of the conversion process?  Could this be setup differently?  It should be noted that some of the doc/docx files may be 100+ pages long with various formatting settings within, mainly legal documents.

I'm also using LibreOffice with the conversion plugin.

Posted by Ross  · 29-01-2016 - 02:34

Edited by Ross · 29-01-2016 - 04:26

Here is a example of the code I'm using to achieve this:

<?php
require_once 'path/phpdocx/classes/CreateDocx.inc';

$docx = new CreateDocx();

$docx->enableCompatibilityMode();

$docx->transformDocument('test.docx', 'test.html');

$html='http://www.domainexample.com/path/test.html';
$docx->embedHTML($html, array('isFile' => true));
$docx->createDocx('change/test');
?>

Everything works as intended, except the formatting is way off from the initial test.docx and final change/test.docx.

Edit: When viewing the HTML file in my browser, it actually looks like it preserves more formatting (such as indents) but those are all lost when converting it back to docx.

Posted by admin  · 29-01-2016 - 12:52

Hello,

All tags and CSS styles you're using are aupported? We recommend you to check the supported HTML tags and CSS styles on http://www.phpdocx.com/documentation/introduction/html-to-word-PHP; maybe you need to clean or repair the HTML/CSS before being added with embedHTML, for example to get an exact font-size you must set it as pt.

Or maybe you could use the conversion plugin to transform again HTML to DOCX using LibreOffice 5.

Regards.