⊗ppPsPtAb 37 of 84 menu

Normalization of Absolute Paths During Parsing in PHP

When parsing websites, we typically need to get addresses of links, paths to images, CSS and JavaScript files.

For our purposes, it would be convenient if these addresses contained the full URL along with the domain name, something like this:

<a href="http://targ.loc/dir/sub/page.html">text</a>

In real life, however, the domain name in paths on a site is usually omitted and the paths look something like this (with a slash at the beginning):

<a href="/dir/sub/page.html">text</a>

Let's perform normalization of such a path. Let's say we have the domain we are accessing stored in a variable:

<?php $domain = 'http://targ.loc'; ?>

Let's also assume we extracted the path from the link's href:

<?php $href = '/dir/sub/page.html'; ?>

Let's perform its normalization:

<?php $norm = $domain . $href; ?>

Get all href on the page and perform their normalization:

<a href="/page.html">text</a> <a href="/dir/page.html">text</a> <a href="/dir/sub/page.html">text</a>

Get all href on the page and perform their normalization where necessary:

<a href="/page.html">text</a> <a href="/dir/page.html">text</a> <a href="/dir/sub/page.html">text</a> <a href="http://targ.loc/dir/sub/page.html">text</a>

Get all src on the page and perform their normalization where necessary:

<img src="/img.png"> <img src="/images/img.png"> <img src="http://targ.loc/images/img.png">

Get all paths and perform their normalization:

<link rel="stylesheet" href="/styles.css"> <script src="/scripts.js"></script>

Get all paths and perform their normalization:

div { background: url('/images/img.png'); }

Get all paths and perform their normalization:

div { background: url(/img.png); background: url('/images/img.png'); background: url("/images/img.png"); background: url("http://targ.loc/images/img.png"); }
uzcenuzlmses