Topic: I need to extract hyperlinks and their text from a .docx

Posted by jawaidbazyar · 13-07-2025 - 20:51

Hello,

I have a large number of existing documents that contain hyperlinks. I want to extract :

the hyperlink url

the hyperlink text

from

for instance, here is an XML snippet:

Â  Â  Â  <w:hyperlink r:id="rId3">
Â  Â  Â  Â  <w:r>
Â  Â  Â  Â  Â  <w:rPr>
Â  Â  Â  Â  Â  Â  <w:rStyle w:val="InternetLink"/>
Â  Â  Â  Â  Â  Â  <w:b/>
Â  Â  Â  Â  Â  Â  <w:lang w:val="en-US" w:eastAsia="en-US"/>
Â  Â  Â  Â  Â  </w:rPr>
Â  Â  Â  Â  Â  <w:t>Boeing Special Attention Requirements Bulletin 737-71-1911 RB, Revision 1</w:t>
Â  Â  Â  Â  </w:r>
Â  Â  Â  </w:hyperlink>

I know the rId3 is a reference to an entry in another file containing the hyperlink itself.

I was hoping that there is a single query I can perform against the document to fetch both the text (in this example, "Boeing Special Attention...") and the hyperlink URL.

Right now the closest I have come is using two different API:

this gets me the hyperlink url:

 // Load the existing document
    $indexer = new Indexer($fname);
    $output = $indexer->getOutput();

and this gets me the hyperlink text:

$referenceNode = array(
        'type' => 'link'
    );

    // Extract hyperlinks
    $hyperlinks = $docx->getDocxPathQueryInfo( $referenceNode);
    foreach ($hyperlinks['elements'] as $element) {
        var_dump($element);
    }

Is there an api call where I can get both the url and text together in one call?

Thank you.

Posted by admin · 14-07-2025 - 07:04

Hello,

The current stable version of phpdocx doesn't include a direct method to extract URL links with their related text contents.
Please note that MS Word and other DOCX editors can generate links using two tags: hyperlinks (w:hyperlink) and fields (w:instrText).

If your DOCX only contains w:hyperlinks, you can extract the needed information using a custom code:

$docx = new CreateDocxFromTemplate('document.docx');

// get hyperlinks
$referenceNode = array(
    'customQuery' => '//w:hyperlink',
);
$hyperlinksInfo = $docx->getDocxPathQueryInfo($referenceNode);

// get rels content
$xmlRelsContent = $docx->getWordFiles('word/_rels/document.xml.rels');
$xmlUtilities = new XmlUtilities();
$contentRelsDOM = $xmlUtilities->generateDomDocument($xmlRelsContent);
$contentRelsXpath = new DOMXPath($contentRelsDOM);
$contentRelsXpath->registerNamespace('rel', 'http://schemas.openxmlformats.org/package/2006/relationships');

// get hyperlinks information
$hyperlinks = array();
foreach ($hyperlinksInfo['elements'] as $hyperlinkInfo) {
    $hyperlinkEntries = $contentRelsXpath->query('//rel:Relationship[@Id="'.$hyperlinkInfo->getAttribute('r:id').'"]');
    if ($hyperlinkEntries->length > 0) {
        $hyperlinks[] = array(
            'textContent' => $hyperlinkInfo->textContent,
            'target' => $hyperlinkEntries->item(0)->getAttribute('Target'),
        );
    }
}

var_dump($hyperlinks);

We have opened a task to the dev team, and they have added support in the testing branch to extract this information (from hyperlinks and fields) using Indexer:

linksContents option to get URL and text content from hyperlinks.

Your phpdocx 15.5 license doesn't include LUS (https://www.phpdocx.com/support). If you upgrade to phpdocx 16 and include LUS you can access these changes from the testing branch (https://www.phpdocx.com/support).
If you upgrade your license, please send to contact[at]phpdocx.com if you are using the classic or namespaces package. We'll send you the updated Indexer class with a custom sample.

Regards.

Forum