ProgrammingFullstack Developer

Explain the principle of lazy and greedy quantifiers processing in Perl regular expressions. How does it affect string parsing? Provide examples of subtle issues and non-standard behavior.

Pass interviews with Hintsage AI assistant

Answer.

In Perl, the quantifiers in regular expressions — *, +, ?, {n,m} — are greedy by default: they capture the maximum possible number of characters that match the pattern.

Adding ? after a quantifier makes it lazy (or non-greedy): it captures the minimum possible number of characters for the entire regex to match.

Example of greedy matching:

my $str = 'foo <bar> baz <quux>'; $str =~ /<.*>/; # Will capture '<bar> baz <quux>'

Example of lazy matching:

my $str = 'foo <bar> baz <quux>'; $str =~ /<.*?>/; # Will capture '<bar>'

Feature:

A greedy expression can "eat" more than you expect when parsing HTML or other nested constructs!


Trick question.

What are the differences between the following two regexes when parsing the string <a><b><c>: /<(.*)>/ and /<(.*?)>/?

Answer:

  • /<(.*)>/ (greedy) will capture the maximum block — match: <a><b><c>
  • /<(.*?)>/ (lazy) — only the first group: <a>

Example:

my $s = '<a><b><c>'; $s =~ /<(.*)>/; # $1: 'a><b><c' $s =~ /<(.*?)>/; # $1: 'a'

Examples of real errors due to ignorance of the subtleties of the topic.


Story

In a news headline import application, the programmer wanted to parse the tag name in the string <title>News</title> using /\<(.*)\>/. As a result, the regex captured the entire string between the first < and the last >, rather than the desired element. The error was found when nested tags appeared.


Story

In a logical parser for extracting quoted strings, the pattern /"(.*)"/ unexpectedly captured everything between the first and last quote. As a result, the markup was broken incorrectly until the pattern was replaced with /"(.*?)"/.


Story

In an automatic CSV parser with quoted capabilities, a pattern written in "greedy" was wrong, causing multiple columns to merge into one. The limitation of the introduced parser emerged only with large data — a lazy modification of the pattern solved the problem.