Like heading sais, our task is to replace relative URLs in HTML string with absolute URLs. We are using Java and regular expressions to solve this task.
First we create a regular expression to find URLs we want to update
(href=)("|')([^http].*?)("|')
What it does? It tries to match all "href" attributes which are not starting with "http". It groups a match into groups. First part is attribute name with equal sign ("href="), then quotes (both single and double quotes could be used), then value of attribute and then again quotes.
First we create a regular expression to find URLs we want to update
(href=)("|')([^http].*?)("|')
What it does? It tries to match all "href" attributes which are not starting with "http". It groups a match into groups. First part is attribute name with equal sign ("href="), then quotes (both single and double quotes could be used), then value of attribute and then again quotes.
Next we have to create Java code to use given reg. expression.
String replaceRelativeLinksWithAbsolute(String html, String url) {
Pattern p = Pattern.compile("(href=)(\"|')([^http].*?)(\"|')");
Matcher m = p.matcher(html);
StringBuffer sb = new StringBuffer();
boolean result = m.find();
while (result) {
m.appendReplacement(sb, createReplacement(url, m));
result = m.find();
}
m.appendTail(sb);
return sb.toString();
}
String createReplacement(String url, MatchResult m) {
StringBuilder r = new StringBuilder();
if (m.groupCount() == 4) {
r.append(m.group(1)); // href=
r.append(m.group(2)); // quote
r.append(url); // absolute path (domain)
r.append(m.group(3)); // existing relative path
r.append(m.group(4)); // quote
}
return r.toString();
}
// TODO: anchors? "href"-s starting with symbol "#"? javascript?
(reg. exp. needs to be updated to avoid changing links like this)
(reg. exp. needs to be updated to avoid changing links like this)
Comments
Post a Comment