Preliminary Text Cleaning When Parsing with Regular Expressions in PHP
When parsing, the page text can contain all sorts of garbage. Before parsing anything with regular expressions, you should get rid of this garbage.
For example, the following text contains HTML comments:
<p>
text1
</p>
<p>
text2
</p>
Let's get rid of them:
<?php
$str = preg_replace('##su', '', $str);
?>
Let's check that now the string has no comments:
<?php
var_dump($str);
?>
Clean the text from style tags.
Clean the text from script tags.
Clean the text from CSS comments.
Clean the text from empty lines.