Forum


Replies: 1   Views: 25
I need to extract hyperlinks and their text from a .docx

Posted by jawaidbazyar  · 13-07-2025 - 20:51

Hello,

I have a large number of existing documents that contain hyperlinks. I want to extract :

the hyperlink url

the hyperlink text

from 

for instance, here is an XML snippet:

      <w:hyperlink r:id="rId3">
        <w:r>
          <w:rPr>
            <w:rStyle w:val="InternetLink"/>
            <w:b/>
            <w:lang w:val="en-US" w:eastAsia="en-US"/>
          </w:rPr>
          <w:t>Boeing Special Attention Requirements Bulletin 737-71-1911 RB, Revision 1</w:t>
        </w:r>
      </w:hyperlink>


I know the rId3 is a reference to an entry in another file containing the hyperlink itself.

I was hoping that there is a single query I can perform against the document to fetch both the text (in this example, "Boeing Special Attention...") and the hyperlink URL.

Right now the closest I have come is using two different API:

this gets me the hyperlink url:

 // Load the existing document
    $indexer = new Indexer($fname);
    $output = $indexer->getOutput();

and this gets me the hyperlink text:
 

$referenceNode = array(
        'type' => 'link'
    );

    // Extract hyperlinks
    $hyperlinks = $docx->getDocxPathQueryInfo( $referenceNode);
    foreach ($hyperlinks['elements'] as $element) {
        var_dump($element);
    }

 

Is there an api call where I can get both the url and text together in one call?

Thank you.

 

Posted by admin  · 14-07-2025 - 07:04

Hello,

The current stable version of phpdocx doesn't include a direct method to extract URL links with their related text contents.
Please note that MS Word and other DOCX editors can generate links using two tags: hyperlinks (w:hyperlink) and fields (w:instrText).

If your DOCX only contains w:hyperlinks, you can extract the needed information using a custom code:

$docx = new CreateDocxFromTemplate('document.docx');

// get hyperlinks
$referenceNode = array(
    'customQuery' => '//w:hyperlink',
);
$hyperlinksInfo = $docx->getDocxPathQueryInfo($referenceNode);

// get rels content
$xmlRelsContent = $docx->getWordFiles('word/_rels/document.xml.rels');
$xmlUtilities = new XmlUtilities();
$contentRelsDOM = $xmlUtilities->generateDomDocument($xmlRelsContent);
$contentRelsXpath = new DOMXPath($contentRelsDOM);
$contentRelsXpath->registerNamespace('rel', 'http://schemas.openxmlformats.org/package/2006/relationships');

// get hyperlinks information
$hyperlinks = array();
foreach ($hyperlinksInfo['elements'] as $hyperlinkInfo) {
    $hyperlinkEntries = $contentRelsXpath->query('//rel:Relationship[@Id="'.$hyperlinkInfo->getAttribute('r:id').'"]');
    if ($hyperlinkEntries->length > 0) {
        $hyperlinks[] = array(
            'textContent' => $hyperlinkInfo->textContent,
            'target' => $hyperlinkEntries->item(0)->getAttribute('Target'),
        );
    }
}

var_dump($hyperlinks);

We have opened a task to the dev team, and they have added support in the testing branch to extract this information (from hyperlinks and fields) using Indexer:

  • linksContents option to get URL and text content from hyperlinks.

Your phpdocx 15.5 license doesn't include LUS (https://www.phpdocx.com/support). If you upgrade to phpdocx 16 and include LUS you can access these changes from the testing branch (https://www.phpdocx.com/support).
If you upgrade your license, please send to contact[at]phpdocx.com if you are using the classic or namespaces package. We'll send you the updated Indexer class with a custom sample.

Regards.