Cookbook

Tips to convert HTML to Word

The embedHTML method and its counterpart for templates, replaceVariableByHTML, allow to convert HTML with CSS to Word, while respecting to the maximum their contents and styles. To achieve the maximum similarity with the original HTML and avoid any errors, it is necessary to follow some good practices.

Supported tags and styles

phpdocx supports nearly all the HTML tags and CSS styles that have an equivalent in MS Word.

In our web you cand find HTML to DOCX documentation and the complete list of compatible tags and styles.

When working with HTML 5 tags (such as 'section' or 'main') and an old version of Tidy, you may need to upgrade to the latest release of Tidy to set styles correctly. Otherwise, some styles may not be applied to these tags.

Beside these HTML tags and CSS styles, when importing HTML you can assign too existing Word styles to classes, ids or specific tags with the option wordStyles.

HTML Extended and CSS Extended features allow to use custom HTML tags and CSS styles to invoke the library methods, and thus add contents and styles not available in the standard HTML. Thanks to this functionality it is possible to use HTML to insert headers, footers, comments, TOCs, page number, WordFragments and many other contents and styles.

Tidy, incorrect tagging, accents and other non ASCII characters

For a proper HTML import, it is mandatory that the tags and styles are correctly opened and closed. In other words, that the structure of the code is right. phpdocx uses the PHP extension Tidy (http://php.net/manual/en/book.tidy.php) to correct the HTML and generate a valid tagging. You can install this extension in any operating system with PHP.

To import HTML with accents, we also recommend installing the PHP mbstring extension to auto detect mime encoding.

Warning

If you haven't installed the Tidy extension, errors may ocurr, like appearing the CSS styles in the document, import with errors the HTML or not displaying accents and other non ASCII characters.

Defining widths in tables

In order to correctly assign widths to tables' columns it is advisable to define the width of the table as well as its cells. You can choose between percentual values or fixed widths, the latter being the recommended choice. You cannot combine both, e.g., choosing a 10% width for one column and 400px for another.

Divide and Optimize

Although the import of HTML and CSS is optimized to the maximum, transforming thousands of lines with different tagging and styles may affect performance.

The solution to achieve the best possible performance is to divide the code you are importing. E.g.: instead of adding with embedHTML an HTML file of 10000 lines, you could divide it in five HTML files and then call embedHTML for each HTML.

With this easy step you can decrease exponentially CPU and memory consumption.

phpdocx 9 performance improvements

phpdocx 9 included several changes in the HTML to Word classes to get an average improvement of 60% less RAM used and 15% faster compared to previous versions.

Extra blank spaces added to the beginning of paragraphs

HTML to DOCX methods use PHP Tidy to repair HTML contents automatically. A few versions of PHP Tidy don't work correctly when the default wrap value is 0 (no wrapping), and add extra line breaks to the HTML, so a blank space may appear at the beginning of paragraphs.

phpdocx 10 and newer releases use a very high wrap value (9999999999) to avoid this bug from specific PHP Tidy versions. The disableWrapValue option can also be used to avoid using the wrap value from phpdocx and use the value set in the PHP Tidy config file.