Parsing Based on Sitemap.xml via PHP

Often, a website has a sitemap file sitemap.xml. This file contains links to all the pages of the site for the convenience of indexing by search engines (indexing is essentially parsing the site by Yandex and Google). You can read about the structure of this file on Wikipedia.

The presence of such a file saves us from having to obtain all the site's addresses by some tricky methods. It is enough for us to get the contents of the file and separate the URLs of the target pages from the non-target ones.

To check for the presence of this file on any website, simply type sitemap.xml in the browser's address bar after the domain name and press Enter. If something opens up, then you can try this method, and if not, then this method is not applicable to this site. Sometimes the path to the sitemap is not standard, but is contained in the robots.txt file.

If such a file exists, we can easily get it as follows:

<?php
	$xml = simplexml_load_file('http://targ.loc/sitemap.xml');
?>

And then iterate through the records with a loop and separate the URLs of the target pages from the non-target ones.

Explore various websites on the internet. Determine if they have a sitemap.

Take a website that has a sitemap, and parse pages with content from it.