Step-by-Step Method for Website Parsing in PHP
On large websites, it is a common situation when on the main page we parse links to category pages, and on the category pages we then parse links to the target pages. Parsing such websites is carried out in several stages. Let's see how it is done.
Suppose on the main page we have links to category pages:
<nav>
<a href="http://targ.loc/1/">1</a>
<a href="http://targ.loc/2/">2</a>
<a href="http://targ.loc/3/">3</a>
</nav>
Suppose on each category page there are links to article pages, the texts of which we want to parse:
<nav>
<a href="http://targ.loc/1/1.html">1</a>
<a href="http://targ.loc/1/2.html">2</a>
<a href="http://targ.loc/1/3.html">3</a>
</nav>
Let's perform multi-stage parsing:
<?php
$href0 = 'http://targ.loc/';
$text0 = getPage($href0);
$hrefs1 = getHrefs($text0);
foreach ($hrefs1 as $href1) {
$text1 = getPage($href1);
$hrefs2 = getHrefs($text1);
foreach ($hrefs2 as $href2) {
$text2 = getPage($href2);
var_dump($text2);
}
}
?>
Download the website from the link targ1.zip
and deploy it locally. Write a parser
that will parse all the final pages
and get the content of the tags title
and main
.
Modify the previous task so that the titles and texts of the pages are saved to the database.