Parsing Based on Sitemap.xml via PHP
Often, a website has a sitemap file sitemap.xml.
This file contains links to all the pages
of the site for the convenience of indexing by search engines
(indexing is essentially parsing
the site by Yandex and Google). You can read about the structure of this
file on
Wikipedia.
The presence of such a file saves us from having to obtain all the site's addresses by some tricky methods. It is enough for us to get the contents of the file and separate the URLs of the target pages from the non-target ones.
To check for the presence
of this file on any website, simply
type sitemap.xml
in the browser's address bar after
the domain name and press Enter. If something
opens up, then you can try this
method, and if not, then this method
is not applicable to this site. Sometimes
the path to the sitemap is not standard,
but is contained in the robots.txt file.
If such a file exists, we can easily get it as follows:
<?php
$xml = simplexml_load_file('http://targ.loc/sitemap.xml');
?>
And then iterate through the records with a loop and separate the URLs of the target pages from the non-target ones.
Explore various websites on the internet. Determine if they have a sitemap.
Take a website that has a sitemap, and parse pages with content from it.