Forum


Replies: 2   Views: 1355
Special utf-8 characters not displaying correctly, no matter what i do
Topic closed:
Please note this is an old forum thread. Information in this post may be out-to-date and/or erroneous.
Every phpdocx version includes new features and improvements. Previously unsupported features may have been added to newer releases, or past issues may have been corrected.
We encourage you to download the current phpdocx version and check the Documentation available.

Posted by reid  · 20-02-2017 - 15:29

I know this is an old topic, but none of the other posts have helped...  I have a UTF-8 database which I am connecting to on a utf8mb4 charset.  (Otherwise, not all special characters come over correctly, as it's a difference between 3-bit UTF-8 and 4-bit UTF-8.)

My website is using UTF-8 meta tags (<meta charset="utf-8">) which I am using to display data from the database.

I use another script to access my web page to grab the generated HTML so that I can create a Word document from it.  I remove the "<head></head>" portion of the code, but have also tested this with leaving it in, which only results in adding extra text to the top of the Word document due to text in the head.  If I view this web page directly, all of the UTF-8 characters show up as expected.  But in the Word document, they either show up with extra characters around them, or only show up as other characters.

During the entire process, I am able to check the encoding of the text to make sure it is UTF-8 encoded and always see that it is using " mb_detect_encoding($str,'UTF-8',true);".  I've checked the database field text by itself, and the HTML retrieved from the web page before phpdocx gets it.  It always comes back as UTF-8.  Even the files I use to write my code are encoded in UTF-8.

I have attempted this using the configuration setting "encode_to_UTF8" set to "true" and also set to "false" just for testing purposes, but both have the same result as expected since it is only supposed to encode it to UTF-8 if it is not already UTF-8 (which it is).

Furthermore, I have also tried this without using my HTML at all, but with just using a single special character typed directly into my file - which again, is encoded in UTF-8.  But it results in the same thing; another character is added in front of the character that I wanted displayed.

I'm not sure what else to try at this point.  I'll post my code below.  Here's a free function for everyone.  ;P

function outputWord($print_url, $filename = 'export.docx', $header_url = null, $footer_url = null) {
        $filename = slugifyWithExt($filename);
        
        ini_set('memory_limit','4129M');
        
        $url  = isset($_SERVER['HTTPS']) ? 'https://' : 'http://';
        $url .= $_SERVER['SERVER_NAME'];
        
        $filepath = dirname(__DIR__).'/tmp/';
        
        $print_url = $url.trim(strip_tags($print_url));
        
        $html = htmlspecialchars_decode(file_get_contents($print_url));
        
        $remove = stristr($html, '<body', true);
        
        $html = str_replace($remove, '', $html);
        $html = '<html>'.$html;
        
        require_once(dirname(__DIR__).'/vendor/phpdocx/classes/CreateDocx.inc');
        
        $document = new CreateDocx();
        $document->modifyPageLayout('A4', array('marginTop' => 100, 'marginRight' => 100, 'marginBottom' => 100, 'marginLeft' => 100));
        
        $document->embedHTML($html);
        
        if ($header_url) {
                $header_url = $url.trim(strip_tags($header_url));
                $header_html = htmlspecialchars_decode(file_get_contents($header_url));
                
                $remove = stristr($header_html, '<body', true);
                
                $header_html = str_replace($remove, '', $header_html);
                $header_html = '<html>'.$header_html;
                
                $htmlHeader = new WordFragment($document, 'defaultHeader');
                $htmlHeader->embedHTML($header_html);
                $document->addHeader(array('default' => $htmlHeader));
        }
        
        if ($footer_url) {
                $footer_url = $url.trim(strip_tags($footer_url));
                $footer_html = htmlspecialchars_decode(file_get_contents($footer_url));
                
                $remove = stristr($footer_html, '<body', true);
                
                $footer_html = str_replace($remove, '', $footer_html);
                $footer_html = '<html>'.$footer_html;
                
                $htmlFooter = new WordFragment($document, 'defaultFooter');
                $htmlFooter->embedHTML($footer_html);
                $document->addFooter(array('default' => $htmlFooter));
        }
        
        // Saving the document as OOXML file...
        $document->createDocx($filepath.$filename);
        
        //headers
        header('Pragma: public');
        header("Expires: 0");
        header("Cache-Control: must-revalidate, post-check=0, pre-check=0");
        header("Cache-Control: private", false);
        header('Content-Description: File Transfer');
        header("Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=utf-8");
        header('Content-Disposition: attachment; filename='.$filename.';');
        header('Content-Transfer-Encoding: binary');
        
        echo file_get_contents($filepath.$filename);
        
        unlink($filepath.$filename);
        
        die();
}

Examples of Output

https://manage.fairwarningsoftware.com/files/downloads/web-example.jpg

https://manage.fairwarningsoftware.com/files/downloads/word-example.jpg

So above is an example of what is seen on a web browser, and what is seen in Microsoft Word.  Notice in the dates, the m-dash is messed up in the Word document output.

Likewise, when I used only '£' for the line "$document->embedHTML('£');", it printed "£" instead, and that was all I had in the document.