Forum


Replies: 2   Views: 1288
Special utf-8 characters not displaying correctly, no matter what i do
Topic closed:
Please note this is an old forum thread. Information in this post may be out-to-date and/or erroneous.
Every phpdocx version includes new features and improvements. Previously unsupported features may have been added to newer releases, or past issues may have been corrected.
We encourage you to download the current phpdocx version and check the Documentation available.

Posted by reid  · 20-02-2017 - 15:29

I know this is an old topic, but none of the other posts have helped...  I have a UTF-8 database which I am connecting to on a utf8mb4 charset.  (Otherwise, not all special characters come over correctly, as it's a difference between 3-bit UTF-8 and 4-bit UTF-8.)

My website is using UTF-8 meta tags (<meta charset="utf-8">) which I am using to display data from the database.

I use another script to access my web page to grab the generated HTML so that I can create a Word document from it.  I remove the "<head></head>" portion of the code, but have also tested this with leaving it in, which only results in adding extra text to the top of the Word document due to text in the head.  If I view this web page directly, all of the UTF-8 characters show up as expected.  But in the Word document, they either show up with extra characters around them, or only show up as other characters.

During the entire process, I am able to check the encoding of the text to make sure it is UTF-8 encoded and always see that it is using " mb_detect_encoding($str,'UTF-8',true);".  I've checked the database field text by itself, and the HTML retrieved from the web page before phpdocx gets it.  It always comes back as UTF-8.  Even the files I use to write my code are encoded in UTF-8.

I have attempted this using the configuration setting "encode_to_UTF8" set to "true" and also set to "false" just for testing purposes, but both have the same result as expected since it is only supposed to encode it to UTF-8 if it is not already UTF-8 (which it is).

Furthermore, I have also tried this without using my HTML at all, but with just using a single special character typed directly into my file - which again, is encoded in UTF-8.  But it results in the same thing; another character is added in front of the character that I wanted displayed.

I'm not sure what else to try at this point.  I'll post my code below.  Here's a free function for everyone.  ;P

function outputWord($print_url, $filename = 'export.docx', $header_url = null, $footer_url = null) {
        $filename = slugifyWithExt($filename);
        
        ini_set('memory_limit','4129M');
        
        $url  = isset($_SERVER['HTTPS']) ? 'https://' : 'http://';
        $url .= $_SERVER['SERVER_NAME'];
        
        $filepath = dirname(__DIR__).'/tmp/';
        
        $print_url = $url.trim(strip_tags($print_url));
        
        $html = htmlspecialchars_decode(file_get_contents($print_url));
        
        $remove = stristr($html, '<body', true);
        
        $html = str_replace($remove, '', $html);
        $html = '<html>'.$html;
        
        require_once(dirname(__DIR__).'/vendor/phpdocx/classes/CreateDocx.inc');
        
        $document = new CreateDocx();
        $document->modifyPageLayout('A4', array('marginTop' => 100, 'marginRight' => 100, 'marginBottom' => 100, 'marginLeft' => 100));
        
        $document->embedHTML($html);
        
        if ($header_url) {
                $header_url = $url.trim(strip_tags($header_url));
                $header_html = htmlspecialchars_decode(file_get_contents($header_url));
                
                $remove = stristr($header_html, '<body', true);
                
                $header_html = str_replace($remove, '', $header_html);
                $header_html = '<html>'.$header_html;
                
                $htmlHeader = new WordFragment($document, 'defaultHeader');
                $htmlHeader->embedHTML($header_html);
                $document->addHeader(array('default' => $htmlHeader));
        }
        
        if ($footer_url) {
                $footer_url = $url.trim(strip_tags($footer_url));
                $footer_html = htmlspecialchars_decode(file_get_contents($footer_url));
                
                $remove = stristr($footer_html, '<body', true);
                
                $footer_html = str_replace($remove, '', $footer_html);
                $footer_html = '<html>'.$footer_html;
                
                $htmlFooter = new WordFragment($document, 'defaultFooter');
                $htmlFooter->embedHTML($footer_html);
                $document->addFooter(array('default' => $htmlFooter));
        }
        
        // Saving the document as OOXML file...
        $document->createDocx($filepath.$filename);
        
        //headers
        header('Pragma: public');
        header("Expires: 0");
        header("Cache-Control: must-revalidate, post-check=0, pre-check=0");
        header("Cache-Control: private", false);
        header('Content-Description: File Transfer');
        header("Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=utf-8");
        header('Content-Disposition: attachment; filename='.$filename.';');
        header('Content-Transfer-Encoding: binary');
        
        echo file_get_contents($filepath.$filename);
        
        unlink($filepath.$filename);
        
        die();
}

Examples of Output

https://manage.fairwarningsoftware.com/files/downloads/web-example.jpg

https://manage.fairwarningsoftware.com/files/downloads/word-example.jpg

So above is an example of what is seen on a web browser, and what is seen in Microsoft Word.  Notice in the dates, the m-dash is messed up in the Word document output.

Likewise, when I used only '£' for the line "$document->embedHTML('£');", it printed "£" instead, and that was all I had in the document.

Posted by admin  · 20-02-2017 - 16:04

Hello,

All characters are fully tested (spanish, german, hebrew, arabic...) when using the embedHTML method. Please check that Tidy is installed and enabled on your server and try again; you need to use Tidy to import all characters correctly.

You can find a receipt in the cookbook about accents and other special characters:

http://www.phpdocx.com/documentation/cookbook/convert-html-to-word

If after install and enable Tidy, you still have issues, please send the most sample script that illustrate your issue (without external connections or database access) to contact[at]phpdocx.com and the dev team check it to find any mistake.

Regards.

P.S. : we have changed your username to hide the email you wrote as username

Posted by reid  · 20-02-2017 - 17:34

Thanks, I didn't even realize my username was just my email.  And also, thank you for the information about "Tidy".  This has resolved my issue.  I really wish I would have just posted sooner.  haha~