⊗ppPsMtSt 47 of 84 menu

Step-by-Step Method for Website Parsing in PHP

On large websites, it is a common situation when on the main page we parse links to category pages, and on the category pages we then parse links to the target pages. Parsing such websites is carried out in several stages. Let's see how it is done.

Suppose on the main page we have links to category pages:

<nav> <a href="http://targ.loc/1/">1</a> <a href="http://targ.loc/2/">2</a> <a href="http://targ.loc/3/">3</a> </nav>

Suppose on each category page there are links to article pages, the texts of which we want to parse:

<nav> <a href="http://targ.loc/1/1.html">1</a> <a href="http://targ.loc/1/2.html">2</a> <a href="http://targ.loc/1/3.html">3</a> </nav>

Let's perform multi-stage parsing:

<?php $href0 = 'http://targ.loc/'; $text0 = getPage($href0); $hrefs1 = getHrefs($text0); foreach ($hrefs1 as $href1) { $text1 = getPage($href1); $hrefs2 = getHrefs($text1); foreach ($hrefs2 as $href2) { $text2 = getPage($href2); var_dump($text2); } } ?>

Download the website from the link targ1.zip and deploy it locally. Write a parser that will parse all the final pages and get the content of the tags title and main.

Modify the previous task so that the titles and texts of the pages are saved to the database.

byenru