Filtering HTML options for the embedding of HTML into Word

  • Jun 26, 2012

This information is outdated. Since PHPDocX 3.0 one can also use XPath expressions for filtering the embedded HTML. Please, refer to the HTML to Word documentation for up to date info.
We have received a few questions about how to use the filter option available in the embedHTML and replaceTemplateVariableByHTML methods. So it may be now the time to extend the explanations included in the tutorial and API documentation via some simple examples.

We will use this simple HTML page as the HTML source for our examples.

We are going to cover the most important options:

Select content by id
Select content by CSS class
Select content by HTML tag
A combination of all of them

Select HTML content by id

This example is already covered in the tutorial but we include it here for completeness.

If we want to extract the content of an HTML content with id=”lateral” we have just to include in the filter option: filter=>array("#lateral") So the PHPDocX code reads:

$docx->embedHTML('', array('isFile' => true, 'parseDivsAsPs' => false, 'filter' => array('#lateral'), 'downloadImages' => true));

The resulting document reads like this.

One may also choose more than one id at a time:

$docx->embedHTML('', array('isFile' => true, 'parseDivsAsPs' => false, 'filter' => array('#lateral','#capa_bg_bottom'), 'downloadImages' => true));

As you may see in this case both ids are selected although some of the format is lost. If you want to preserve the format you should use the replaceTemplateVariableByHTML method (we will not elaborate more on that because it is out of the scope of this blog entry).

Select HTML content by Class

To illustrate this particular case we are going to extract all HTML contents with classes “rosa” and “naranja”. So we need this time the following code:

$docx->embedHTML('', array('isFile' => true, 'parseDivsAsPs' => false, 'filter' => array('.rosa','.naranja'), 'downloadImages' => true));

The resulting Word document reads like this.

Select HTML content by HTML tag

How we will do if we just want to extract the content within tags?

Pretty simple:

$docx->embedHTML('', array('isFile' => true, 'parseDivsAsPs' => false, 'filter' => array('p'), 'downloadImages' => true));

Isn´t it? (download document).

Mixed selection

If we now want to extract the content within s and with id "e;entrada"e; we just need to insert this piece of code:

$docx->embedHTML('', array('isFile' => true, 'parseDivsAsPs' => false, 'filter' => array('#entrada','h2'), 'downloadImages' => true));

To get this resulting Word document.

We hope that at this point to generalize the procedure to more sophisticated cases could be pretty straightforward (but have a look at: