In Perl, the quantifiers in regular expressions — *, +, ?, {n,m} — are greedy by default: they capture the maximum possible number of characters that match the pattern.
Adding ? after a quantifier makes it lazy (or non-greedy): it captures the minimum possible number of characters for the entire regex to match.
my $str = 'foo <bar> baz <quux>'; $str =~ /<.*>/; # Will capture '<bar> baz <quux>'
my $str = 'foo <bar> baz <quux>'; $str =~ /<.*?>/; # Will capture '<bar>'
A greedy expression can "eat" more than you expect when parsing HTML or other nested constructs!
What are the differences between the following two regexes when parsing the string
<a><b><c>:/<(.*)>/and/<(.*?)>/?
Answer:
/<(.*)>/ (greedy) will capture the maximum block — match: <a><b><c>/<(.*?)>/ (lazy) — only the first group: <a>Example:
my $s = '<a><b><c>'; $s =~ /<(.*)>/; # $1: 'a><b><c' $s =~ /<(.*?)>/; # $1: 'a'
Story
In a news headline import application, the programmer wanted to parse the tag name in the string
<title>News</title>using/\<(.*)\>/. As a result, the regex captured the entire string between the first<and the last>, rather than the desired element. The error was found when nested tags appeared.
Story
In a logical parser for extracting quoted strings, the pattern
/"(.*)"/unexpectedly captured everything between the first and last quote. As a result, the markup was broken incorrectly until the pattern was replaced with/"(.*?)"/.
Story
In an automatic CSV parser with quoted capabilities, a pattern written in "greedy" was wrong, causing multiple columns to merge into one. The limitation of the introduced parser emerged only with large data — a lazy modification of the pattern solved the problem.