Spider Method on an Array for Parsing Websites in PHP
The step-by-step method described in the previous lesson is quite inconvenient, and this inconvenience increases with the growth in the number of subcategories. There is an alternative method called the spider method (en. crawler).
The essence of this method is as follows. The parser goes to the main page of the site, takes all the links from there, and saves them. Then it takes the first saved link, goes to it, saves the page text, and also extracts all links from this page and adds to its list those links that are not there yet.
Thus, we get a universal parser that can parse any site, regardless of its category structure.
Let's implement the described algorithm:
<?php
$paths = [
'http://targ.loc/',
];
$i = 0;
while ($i < count($paths)) {
$path = $paths[$i];
$text = getPage($path);
$hrefs = getHrefs($text);
foreach ($hrefs as $href) {
if (!in_array($href, $paths)) {
$paths[] = $href;
}
}
var_dump($text); // do something with the text
$i++;
}
var_dump($paths);
?>
Download the site from the link targ1.zip
and deploy it locally. Write a parser
that will parse all pages of the site and get
the contents of the title
and main
tags.
Modify the previous task so that the titles and texts of the pages are saved to the database.
Tell us what advantages and disadvantages you see in the step-by-step method and in the spider method.