News

Convert HTML to Word with PHP

  • Apr 24, 2012

Warning

This post is outdated. For up date information about convert HTML to Word with PHP please refer to HTML to Word.

One of the most demanded functionalities by our PHPDocX users is the posibility to generate Word documents out of HTML.
Since the launch of the 2.5 version of PHPDocX we have at our disposal two new methods: embedHTML() and replaceTemplateVariableByHTML() that allow to convert HTML into Word with a high degree of customization.
The configuration options for both new methods, embedHTML and replaceTemplateVariableByHTML, include:


The posibility of extracting the HTML from:

A external URL.
A internal HTML file.
A string of HTML code.


To select different containers of the whole HTML code for the HTML to Word conversion.
Embed or not the images included in the HTML code.
Embed the HTML into a template via replaceTemplateVariableByHTML().
Use different styles:

The ones included in the CSS stylesheet used in the HTML or written inline in the very same HTML code.
The Word styles included in the template used by PHPDocX.
Or a combination of both.




Moreover this conversión is obtained by direct translation of the HTML code into WordProcessingML (the native Word format) so the result is fully compatible with Open Office (and all its avatars), the Microsoft compatibility pack for Word 2003 and most importantly with the conversión to PDF, DOC, ODT and RTF included in the library. I guess that at this point you would like to have a more clear idea of how the HTML to Word converter method works and see some real examples of use and results.
Let’s start with a very simple example and let us build upon it.
Let us first insert a simple HTML string into a Word document:


require_once('path_to_phpdocx/classes/createDocx.inc');
$docx = new CreateDocx();

$html = 'A very simple HTML example.';
$docx->embedHTML($html);
$docx->createDocx('embedHTML_1');


As you can check from the result:

the string of HTML code has been included in the Word document.
Let us now go one step further: include some HTML code but making use of some native Word styles:


require_once('path_to_phpdocx/classes/createDocx.inc');
$docx = new CreateDocx();
$myHTML = ''We include a table with rowspans and colspans using the embedHTML method.


header 1
header 2
header 3
header 4


cell_1_1
cell_1_3
cell_1_4


cell_2_3
cell_2_4


cell_3_1
cell_3_2
cell_3_3
cell_3_4

';

$docx->embedHTML($myHTML, array('tableStyle' => 'MediumGrid3-accent5PHPDOCX'));
$docx->createDocx('embedHTML_2');


So you obtain (download Word document):

You may also extract the HTML content from an existing file but rather than illustrate that functionality with the embedHTML() method we will use its equivalent method: replaceTemplateVariableByHTML() that is used when Word templates get into play:


require_once 'classes/CreateDocx.inc';

$docx = new CreateDocx();
$docx->addTemplate('testHTML2mdc.docx');

$docx->replaceTemplateVariableByHTML('ADDRESS', 'inline', 'C/ Matías Turrión 24, Madrid 28043 Spain', array('isFile' => false, 'parseDivsAsPs' => true, 'downloadImages' => false));
$docx->replaceTemplateVariableByHTML('CHUNK_1', 'block', 'http://www.2mdc.com/PHPDOCX/example.html', array('isFile' => true, 'parseDivsAsPs' => true, 'filter' => 'capa_bg_bottom', 'downloadImages' => true));
$docx->replaceTemplateVariableByHTML('CHUNK_2', 'block', 'http://www.2mdc.com/PHPDOCX/example.html', array('isFile' => true, 'parseDivsAsPs' => false, 'filter' => 'lateral', 'downloadImages' => true));

$docx->createDocx('webpage');



So from a very simple template:

And a standard web page, one may get this Word document:

You may notice that to get the effect of the floating div on the right of the web page we have inserted a table into the template and we have consequently extracted the correspondings "columns in HTML", using the parameter id in the replaceTemplatevariableByHTML method, because this is the option that guarantees a closer fit to the original format.
We would like also to comment on the different options for the "filter" parameter option:

"filter" can be an array or a single string, depending on how many different type of elements we wish to include in the final Word document.
In either case the values that it can take are a string with or without certain prefixes/sufixes:

"#" to select an HTML element with a certain id, for example, #main_content will only extract the HTML within the div (or any other HTML element) tagged as .
"." to select an HTML element with a certain class, for example, .news will only extract the HTML within the divs (or any other HTML elements) tagged as .
"< >" to select an HTML element with a certain HTML tag, for example, will only extract the HTML within paragraphs.
If the prefix/suffix is omitted PHPDocX will understand that it should extract all content with that id, class or tag, for example p will extract all the HTML with id, class or tagnamed p.



Obviously the posibilities are, in principle, practically unlimited and we do not want to bore you here with excesive detail.
If you want to check all the available options you may have a look at:

API documentation for embedHTML().
API documentation for replaceTemplateVariablebyHTML().
PHPDocX tutorial.