Spider Method with Filtering in PHP
When parsing a website using the spider method, our parser will crawl all pages of the site. However, not all pages are target ones. We need to filter out pages that we want to parse and save to our database.
Let's recall the place where we do something with the page text:
<?php
var_dump($text); // do something with the text
?>
This is where we will filter out unnecessary pages. At the same time, as a rule, we will collect links from all pages (otherwise, we might miss some parts of the website), but we will save to the database only the target pages that we need. So, let's see how we can filter pages.
The first and most reliable method is applicable
if the URL of target pages differs
from non-target ones. For example, on the website from the previous
lesson, the target pages had the extension
.html
, while non-target ones did not.
In this case, we can base
our check on this:
<?php
if (!preg_match('#\.html$#', $path)) {
var_dump($text);
}
?>
The second method is applicable if the target
pages have a distinctive feature
in their layout. This could be an id
,
a specific CSS class, or the presence
of a certain tag. Example:
<?php
if (preg_match('#<main id="content">#', $text)) {
var_dump($text);
}
?>
The third method is applicable when the target pages have some characteristic text feature. For example, like this:
<?php
if (preg_match('#<title>page \d+</title>#', $text)) {
var_dump($text);
}
?>
Study the website code.mu. Explain, by what features one can separate article pages from table of contents pages.
Study websites on the internet. List five websites where, based on different features, one can separate content pages from other pages.