⊗ppPsMtCDo 50 of 84 menu

Spider Method with Filtering in PHP

When parsing a website using the spider method, our parser will crawl all pages of the site. However, not all pages are target ones. We need to filter out pages that we want to parse and save to our database.

Let's recall the place where we do something with the page text:

<?php var_dump($text); // do something with the text ?>

This is where we will filter out unnecessary pages. At the same time, as a rule, we will collect links from all pages (otherwise, we might miss some parts of the website), but we will save to the database only the target pages that we need. So, let's see how we can filter pages.

The first and most reliable method is applicable if the URL of target pages differs from non-target ones. For example, on the website from the previous lesson, the target pages had the extension .html, while non-target ones did not. In this case, we can base our check on this:

<?php if (!preg_match('#\.html$#', $path)) { var_dump($text); } ?>

The second method is applicable if the target pages have a distinctive feature in their layout. This could be an id, a specific CSS class, or the presence of a certain tag. Example:

<?php if (preg_match('#<main id="content">#', $text)) { var_dump($text); } ?>

The third method is applicable when the target pages have some characteristic text feature. For example, like this:

<?php if (preg_match('#<title>page \d+</title>#', $text)) { var_dump($text); } ?>

Study the website code.mu. Explain, by what features one can separate article pages from table of contents pages.

Study websites on the internet. List five websites where, based on different features, one can separate content pages from other pages.

byenru