PHP Limitations in Parsing
PHP is not the best language for parsing. This is due to a fundamental feature of PHP scripts. They are designed to start, execute quickly, and then terminate.
Typically, PHP scripts run for no more
than a couple of seconds. And in the PHP settings,
there is even a limit set that forcibly terminates
a script if it runs for more than 30
-60
seconds. Parsing, however, usually requires
much more time - from several minutes to hours and days.
Furthermore, the amount of RAM a script can occupy is measured in megabytes. If a script attempts to use more, it will be forcibly terminated. It is easy to exceed this limit when running a parser, if, for example, you try to create an array consisting of the text from hundreds of HTML pages.
In long-running PHP scripts, memory leaks begin to occur, where the script gradually starts to occupy more and more space in RAM over time. Eventually, they reach such a magnitude that the script is forcibly terminated.
Due to these limitations, programmers have to resort to various tricks to make PHP scripts work in a way they were not designed for.
Unlike PHP, scripts written in Python or NodeJS do not terminate; instead, they represent a process constantly loaded into memory.
Why do people choose PHP for parsing then? The fact is that most websites run on PHP, and usually, a parser working on PHP is also added to such a site, especially since the programmer often knows PHP and does not want to learn something else just for the sake of a parser.
In the following lessons, we will consider approaches that allow us to bypass these limitations.