Forum


Replies: 1   Views: 216
I need to extract hyperlinks and their text from a .docx
Topic closed:
Please note this is an old forum thread. Information in this post may be out-to-date and/or erroneous.
Every phpdocx version includes new features and improvements. Previously unsupported features may have been added to newer releases, or past issues may have been corrected.
We encourage you to download the current phpdocx version and check the Documentation available.

Posted by jawaidbazyar  · 13-07-2025 - 20:51

Hello,

I have a large number of existing documents that contain hyperlinks. I want to extract :

the hyperlink url

the hyperlink text

from 

for instance, here is an XML snippet:

      <w:hyperlink r:id="rId3">
        <w:r>
          <w:rPr>
            <w:rStyle w:val="InternetLink"/>
            <w:b/>
            <w:lang w:val="en-US" w:eastAsia="en-US"/>
          </w:rPr>
          <w:t>Boeing Special Attention Requirements Bulletin 737-71-1911 RB, Revision 1</w:t>
        </w:r>
      </w:hyperlink>


I know the rId3 is a reference to an entry in another file containing the hyperlink itself.

I was hoping that there is a single query I can perform against the document to fetch both the text (in this example, "Boeing Special Attention...") and the hyperlink URL.

Right now the closest I have come is using two different API:

this gets me the hyperlink url:

 // Load the existing document
    $indexer = new Indexer($fname);
    $output = $indexer->getOutput();

and this gets me the hyperlink text:
 

$referenceNode = array(
        'type' => 'link'
    );

    // Extract hyperlinks
    $hyperlinks = $docx->getDocxPathQueryInfo( $referenceNode);
    foreach ($hyperlinks['elements'] as $element) {
        var_dump($element);
    }

 

Is there an api call where I can get both the url and text together in one call?

Thank you.

 

Posted by admin  · 14-07-2025 - 07:04

Hello,

The current stable version of phpdocx doesn't include a direct method to extract URL links with their related text contents.
Please note that MS Word and other DOCX editors can generate links using two tags: hyperlinks (w:hyperlink) and fields (w:instrText).

If your DOCX only contains w:hyperlinks, you can extract the needed information using a custom code:

$docx = new CreateDocxFromTemplate('document.docx');

// get hyperlinks
$referenceNode = array(
    'customQuery' => '//w:hyperlink',
);
$hyperlinksInfo = $docx->getDocxPathQueryInfo($referenceNode);

// get rels content
$xmlRelsContent = $docx->getWordFiles('word/_rels/document.xml.rels');
$xmlUtilities = new XmlUtilities();
$contentRelsDOM = $xmlUtilities->generateDomDocument($xmlRelsContent);
$contentRelsXpath = new DOMXPath($contentRelsDOM);
$contentRelsXpath->registerNamespace('rel', 'http://schemas.openxmlformats.org/package/2006/relationships');

// get hyperlinks information
$hyperlinks = array();
foreach ($hyperlinksInfo['elements'] as $hyperlinkInfo) {
    $hyperlinkEntries = $contentRelsXpath->query('//rel:Relationship[@Id="'.$hyperlinkInfo->getAttribute('r:id').'"]');
    if ($hyperlinkEntries->length > 0) {
        $hyperlinks[] = array(
            'textContent' => $hyperlinkInfo->textContent,
            'target' => $hyperlinkEntries->item(0)->getAttribute('Target'),
        );
    }
}

var_dump($hyperlinks);

We have opened a task to the dev team, and they have added support in the testing branch to extract this information (from hyperlinks and fields) using Indexer:

  • linksContents option to get URL and text content from hyperlinks.

Your phpdocx 15.5 license doesn't include LUS (https://www.phpdocx.com/support). If you upgrade to phpdocx 16 and include LUS you can access these changes from the testing branch (https://www.phpdocx.com/support).
If you upgrade your license, please send to contact[at]phpdocx.com if you are using the classic or namespaces package. We'll send you the updated Indexer class with a custom sample.

Regards.