A common class of problems that people try to solve with regular expressions is to find all occurrences of a certain pattern, but only within the occurrences of another pattern. This example illustrates this with a block of HTML that contains div tags and paragraph tags within those div tags as well as outside the div tags. We want to match all the paragraphs within the div tags, but not those outside the div tags.
A problem like this is best solved using two regular expressions. Use one regular expression to match the div tags. Use a second regular expression to match the paragraph tags within the matches of the first regular expression. A bit of procedural code glues everything together.
RegexMagic can generate only one regular expression at a time. So we’ll tackle this problem in two steps. We’ll create the two regexes separately and have RegexMagic generate a source code snippet for each. We’ll combine the two code snippets ourselves.
<h1>Matching Something Within Something Else</h1> <p>Introduction</p> <div> <p>We want <i>this</i> paragraph.</p> <table><tr><td>We</td><td>don't</td><td>want</td><td>tables</td></tr></table> <p>We want this one too.</p> </div> <p>We don't want this one.</p> <p>Nor this one.</p> <div> <p>Another one we want.</p> <p>:-)</p> </div> <p>The end.</p>
<div>\p{Any}*?</div>
Required options: Case insensitive.
Unused options: Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Greedy quantifiers.
<h1>Matching Something Within Something Else</h1> <p>Introduction</p> <div> <p>We want <i>this</i> paragraph.</p> <table><tr><td>We</td><td>don't</td><td>want</td><td>tables</td></tr></table> <p>We want this one too.</p> </div> <p>We don't want this one.</p> <p>Nor this one.</p> <div> <p>Another one we want.</p> <p>:-)</p> </div> <p>The end.</p>
preg_match_all('%<div>\p{Any}*?</div>%ui', $subject, $div, PREG_PATTERN_ORDER); for ($i = 0; $i < count($div[0]); $i++) { # Matched text = $div[0][$i]; }
<p>\p{Any}*?</p>
Required options: Case insensitive.
Unused options: Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Greedy quantifiers.
<h1>Matching Something Within Something Else</h1> <p>Introduction</p> <div> <p>We want <i>this</i> paragraph.</p> <table><tr><td>We</td><td>don't</td><td>want</td><td>tables</td></tr></table> <p>We want this one too.</p> </div> <p>We don't want this one.</p> <p>Nor this one.</p> <div> <p>Another one we want.</p> <p>:-)</p> </div> <p>The end.</p>
preg_match_all('%<p>\p{Any}*?</p>%ui', $div[0][$i], $p, PREG_PATTERN_ORDER); $p = $p[0];
preg_match_all('%<div>\p{Any}*?</div>%ui', $subject, $div, PREG_PATTERN_ORDER); for ($i = 0; $i < count($div[0]); $i++) { preg_match_all('%<p>\p{Any}*?</p>%ui', $div[0][$i], $p, PREG_PATTERN_ORDER); $p = $p[0]; }
$pwithindiv = array(); preg_match_all('%<div>\p{Any}*?</div>%ui', $subject, $div, PREG_PATTERN_ORDER); for ($i = 0; $i < count($div[0]); $i++) { preg_match_all('%<p>\p{Any}*?</p>%ui', $div[0][$i], $p, PREG_PATTERN_ORDER); $pwithindiv = array_merge($pwithindiv, $p[0]); }
Though this example requires a lot of steps, it’s all very straightforward. You simply generate two regexes independently, one to match the outer text, and one to match the inner text. In your source code you combine the two regexes to make the second regex search through only text matched by the first regex.
If you’re not developing software, you can use the same method if you’re using an advanced grep tool such as PowerGREP that allows you to use more than one regular expression to run your searches. In PowerGREP, set the “action type” to “search”. Then set “file sectioning” to “search for sections”. Paste the regular expression that matches the outer text (in this example the one for the div tags) into the “section search” box. Then paste the regex for the inner text (the one for p tags) into the main part of the action in PowerGREP. When you execute this action, PowerGREP use the file sectioning regex to find all the div tags, and then use the main regex to find the paragraphs within the div tags only. This is no different from what our PHP code does, except that it requires no programming.