Word to HTML with PHP

Word to HTML

Introduction

phpdocx Advanced and Premium licenses include the functionality of transforming DOCX files to HTML with native PHP classes.

There are currently two ways to transform Word to HTML with phpdocx:

  • With the conversion plugin
  • With the TransformDocAdvHTML native PHP class

The conversion plugin executes LibreOffice or OpenOffice to perform the conversion. This method has a disadvantage: it is not native PHP and requires calling external programs, besides, it doesn't allow to customize the output but with PHP DOM modifications after the conversion.

Native PHP classes included in Advanced and Premium licenses allow to transform DOCX to HTML with PHP exclusively. The main features of this functionality are the following:

  • Conversion of contents, styles and properties
  • Native PHP classes
  • Easily customizable
  • Transform DOCX created from scratch and templates
How to use it

The transformation can be done using just three lines of code:

where document.docx can be a DOCX created with phpdocx or from other source (MS Word, LibreOffice, etc). Premium licenses can also transform in-memory documents.

Supported OOXML tags and attributes

phpdocx parses contents, styles, properties and other XML contents.

The list of currently parsed contents and styles include (OOXML content/style and HTML/CSS transformation):

  • document (w:body) : <body>

    • background color (w:background) => w:color (background-color)
    • background image (v:background) => id (background-image)
    • border (w:pgBorders) => w:top (border-top), w:bottom (border-bottom), w:left (border-left), w:right (border-right): w:color (border-color: #HEX), w:sz (border-width), w:val (border-style: nil, none, dashed, dotted, double, solid), w:space (padding)
  • sections (w:sectPr) : <section>

    • size (w:pgSz) => w:w (max-width)
    • margin (w:pgMar) => w:top (margin-top), w:bottom (margin-bottom), w:left (margin-left), w:right (margin-right)
    • columns (w:cols) => w:num (columns)
  • title and metas (cp:coreProperties) : <title>, <meta>

    • title (dc:title) => <title>
    • author (dc:creator) => <meta> (author)
    • description (dc:description) => <meta> (description)
    • keywords (cp:keywords) => <meta> (keywords)
  • text strings (w:t) and text styles (w:rPr) : <span>

    • text (w:t) => <span>
    • bold (w:b) => w:val (font-weight: bold)
    • color (w:color) => w:val (color: #HEX)
    • double line through (w:dstrike) => w:val (text-decoration-style: double)
    • font family (w:rFonts) => w:ascii (font-family), w:cs (font-family)
    • font size (w:sz) => w:val (font-size)
    • highlight (w:highlight) => w:val (background-color)
    • italic (w:i) => w:val (font-style: italic)
    • line through (w:strike) => w:on (text-decoration: line-through)
    • lower case (w:smallCaps) => w:val (text-transform: uppercase; font-size: small)
    • text decoration (w:u) => w:val (text-decoration: none or underline; text-decoration-style: dashed, dotted, double, solid, wavy, none)
    • upper case (w:caps) => w:val (text-transform: uppercase)
    • vanish (w:vanish) => w:val (visibility: hidden; visibility: visibility)
    • vertical align (w:vertAlign) => w:val (vertical-align: sub; vertical-align: super)
  • paragraphs (w:pPr) : <p>

    • background color (w:shd) => w:shd (background-color)
    • bold (w:b) => w:val (font-weight: bold)
    • border (w:pBdr) => w:top (border-top), w:bottom (border-bottom), w:left (border-left), w:right (border-right), w:color (border-color: #HEX), w:sz (border-width), w:val (border-style: nil, none, dashed, dotted, double, solid), w:space (padding)
    • color (w:color) => w:val (color: #HEX)
    • double line-through (w:dstrike) => w:val (text-decoration-style: double)
    • font family (w:rFonts) => w:ascii (font-family)
    • font size (w:sz) => w:val (font-size)
    • heading (w:outlineLvl) => w:val (h1, h2, h3, h4, h5, h6)
    • highlight (w:highlight) => w:val (background-color)
    • italic (w:i) => w:val (font-style: italic)
    • line height (w:spacing) => w:line (line-height)
    • line through (w:strike) => w:on (text-decoration: line-through)
    • lower case (w:smallCaps) => w:val (text-transform: lowercase)
    • margin (w:ind, w:spacing) => w:left (margin-left), w:start (margin-left), w:right (margin-right), w:end (margin-right), w:after (margin-bottom), w:before (margin-top)
    • padding (w:hanging) => w:hanging (padding-left, text-indent)
    • page break (w:pageBreakBefore) => w:val (page-break-before: always)
    • text align (w:jc) => w:val (text-align: left, justify, center, right)
    • text decoration (w:u) => w:val (text-decoration: none or underline; text-decoration-style: dashed, dotted, double, solid, wavy, none)
    • text indent (w:firstLine) => w:firstLine (text-indent)
    • text direction (w:textDirection) => w:val tbRl (direction: rtl; text-align: right;)
    • upper case (w:caps) => w:val (text-transform: uppercase)
    • vertical-align (w:vertAlign) => w:val (vertical-align: sub; vertical-align: super)
    • word wrap (w:wordWrap) => w:val (word-wrap: break-word)
  • lists (w:numPr) : <ul>, <ol>, <li>

    • type (w:numId) => w:val and w:ilvl (list-style-type: circle, disc, decimal, lower-alpha, lower-roman, upper-alpha, upper-roman)
    • view paragraphs elements for other styles
    • some styles such as color or font sizes can be inherited to the li content from the li symbol. In this case, the content must have its own style
  • links : <a>

    • bookmark (w:bookmarkStart, w:bookmarkEnd) => w:name (<a>)
    • cross-reference (w:instrText) => PAGEREF (<a>)
    • link (w:instrText) => HYPERLINK (<a>)
  • form elements

    • checkbox (w:instrText) => (<input> checkbox)
    • date (w:date) => (<input> date)
    • input (w:instrText) => (<input> text)
    • select (w:instrText, w:comboBox) => (<select>)
  • styles (view elements on this same page for supported styles)

    • character/run (w:rPr)
    • paragraph (w:pPr)
    • list (w:pPr, w:numId, w:ilvl)
    • table (w:style, w:pPr, w:rPr)
    • styles file (w:styles) => character/run (w:rStyle), paragraph and list (w:pStyle), table
    • numbering file => list (w:abstractNum)
    • default styles (w:docDefaults, w:style w:default="1") => w:pPr, w:rPr
  • tables (w:tbl) : <table>

    • align (w:jc) => w:val (margin-left, margin-right)
    • border (w:tblBorders) => w:top, w:right, w:bottom, w:left (border-: width style [dashed, dotted, double, none, solid] color)
    • layout (w:tblLayout) => w:type fixed (table-layout)
    • margin (w:tblInd, w:tblpPr) => w:w (margin-left), w:bottomFromText (margin-bottom), w:topFromText (margin-top)
    • width (w:tblW) => w:type pct, dxa w:w (width)
    • first col style (w:tblStylePr) => w:type (w:rPr styles)
    • first row style (w:tblStylePr) => w:type (w:rPr and w:pPr styles)
    • last col style (w:tblStylePr) => w:type (w:rPr styles)
    • last row style (w:tblStylePr) => w:type (w:rPr and w:pPr styles)
    • band1Horz style (w:tblStylePr) => w:type (w:rPr and w:pPr styles)
    • band2Horz style (w:tblStylePr) => w:type (w:rPr and w:pPr styles)
    • row height (w:trPr) => w:trHeight (height)
    • rowspan (w:vMerge) => w:val restart, continue (rowspan)
    • cell background color (w:shd) => w:fill (background-color)
    • cell border (w:tcPr) => w:top, w:right, w:bottom, w:left (border-: width style [dashed, dotted, double, none, solid] color)
    • cell padding (w:tblCellMar) => w:top (padding-top), w:right (padding-right), w:bottom (padding-bottom), w:left (padding-left)
    • cell vertical align (w:vAlign) => top, bottom, center, both and default w:val (vertical-align)
    • cell width (w:tcW) => w:w (width)
    • colspan (w:gridSpan) => w:val (colspan)
    • text direction (w:textDirection) => w:val btLr, tbLrV, tbRl and tbRlV (writing-mode, transform, white-space)
  • images (w:drawing) : <img>

    • Supported image formats: png, jpg and other formats supported by web browsers. Wmf is supported if ImageImagick is installed
    • border (a:ln, a:noFill) => w (width), a:prstDash (style: dashed, dotted, solid), a:srgbClr (color)
    • float (wp:positionH, wp:align) => right (float: right), left (float: left), center (display:block; margin-left: auto; margin-right: auto)
    • height (wp:extent) => cy (height)
    • link (a:hlinkClick) => r:id (href)
    • margin (wp:effectExtent, wp:positionH, wp:positionV) => t (margin-top), r (margin-right), b (margin-bottom), l (margin-left), wp:positionH wp:posOffset (margin-left), wp:positionV wp:posOffset (margin-top)
    • text wrapping (wp:inline, wp:anchor) => wp:inline (display: inline), wp:wrapSquare (float: left), wp:wrapNone behindDoc (position: absolute; z-index: -1)
    • width (wp:extent) => cx (width)
    • src (r:embed, r:link) => embedded and linked images
    • saved as files or as base64 (only for embedded images)
  • charts (w:drawing) : <div>

    • Supported charts: bar (group, stack and percent), column (group, stack and percent), pie, doughnut and line charts
    • Plotly JS library (MIT license) [https://plotly.com/javascript/] is used as default chart library
    • height (cy)
    • labels (c:cat)
    • legends (c:tx)
    • orientation (h, v)
    • values (c:val)
    • width (cx)
    • Plotly default colors are used
  • other elements

    • break (w:br) => (<br>)
    • comment (w:commentReference, w:comment) => added to the bottom of the page (<span>)
    • date (w:instrText) => TIME (<span>)
    • endnote (w:endnoteReference, w:endnote) => added to the bottom of the page (<span>)
    • external file (w:altChunk) => r:id (<a>)
    • footer (w:footerReference, w:ftr) => (<footer>) added to the bottom of its section
    • header (w:headerReference, w:hdr) => (<header>) added to the top of its section
    • footnote (w:footnoteReference) => added to the bottom of the page (<span>)
    • math equations => Office MathML
    • simple fields (w:fldSimple) => AUTHOR, COMMENTS, LASTSAVEDBY, TITLE
    • tabs (w:tab) => (<span>) margin-left default
    • textbox (v:textbox) => (<div>), style (min-height, float, width), fillcolor (background-color), margin-top (margin-top), strokecolor (border-color, border-style), strokeweight (border-width)
    • tracked contents (w:ins, w:del) => (<ins>, <del>)

    WARNING:

  • The fact that a tag is not parsed does not mean its content disappears from the HTML output. It only implies that their associated OOXML properties are not taken directly into account. Their children and text content will be parsed and rendered with their corresponding styles into the HTML output.
Examples

The transforming features included in phpdocx allow to transform complex DOCX documents generated from scratch or using templates. Let's take a look at some samples and their HTML output.

DOCX with an A4 section and paragraphs:

 

DOCX with tables:

 

DOCX with lists and text styles:

 

DOCX with headers and footers:

 

DOCX from a template:

 

DOCX with charts:

 

How to customize transformations

Nearly all the functionalities available for performing DOCX to HTML transformations can be customized.

The two main classes for transformations are: TransformDocAdvHTML and TransformDocAdvHTMLPlugin.

TransformDocAdvHTML is the class for parsing DOCX structures and performs the transformation to HTML. Its constructor receives an object of the TransformDocAdvHTMLPlugin type that sets the export options. This class can be extended to customize the transformation of each element, e.g., transformW_BOOKMARKSTART for bookmarks or transformW_SECTPR for sections.

TransformDocAdvHTMLPlugin allows to generate transformation plugins according to the project requirements. E.g.: inserting images as base64, ignoring sections, customizing conversion factors, setting the method to set export sizes and set CSS, JavaScript and custom HTML. phpdocx includes the TransformDocAdvHTMLDefaultPlugin, the default plugin to perform transformations.

All the available options are thoroughly explained in the API documentation page of the transformDocAdvHTML method.