RegEx

From Seobility Wiki
Revision as of 11:35, 11 June 2021 by Ralph.ebnet (talk | contribs)

Jump to: navigation, search

Definition

RegEx stands for 'regular expression' and is a method used by programmers to define search patterns. Regex is useful for extracting information from large blocks of data. Data can take many forms, whether that be plain text, files, or code. A regex search pattern is much more powerful and flexible than simple string searches, such as the search queries typically used with search engines.

For example, a regular expression is used when a password policy is stored in software that specifies certain character combinations for passwords. For such a password rule, the expression could look as follows:

(?=^.{8,}$)((?=.*\d)|(?=.*\W+))(?![.\n])(?=.*[A-Z])(?=.*[a-z]).*$"

This rule contains numerous specifications, such as the minimum length of 8 characters and the use of upper and lower case letters. For example, the expression .{8,} means that any character (symbolized by the dot) should occur eight times or more ({8,}).

Components of regular expressions

Regular expressions are commonly found in many different programming languages, but their exact implementation can differ. This means that occasionally, some characters may be used in different ways in different implementations. However, sometimes a character has a relatively universal use. Below are some common regular expression components.

Anchors

Anchors are characters that specify the location within a particular string to search. Regex was developed originally for line-based systems, so a lot of regex was developed around searching within lines. To find a character "A" in a string, you can use characters from the following list to find a match within a line:

  • ^A - Match at the beginning of a line.
  • A$ - Match at the end of a line.

Character sets

Character sets allow you to define explicit parameters for the type of text to be searched for. As an example, numerical ranges can be searched for, using [0-9]. However, regex supports character matching, so it can be useful for finding letter ranges, or for supporting alternate spellings. For example, gr[ae]y will match both 'gray' and 'grey'.

  • [0-9] - Match a range of numbers from 0-9.
  • [a-z] - Match lowercase letters from a-z.
  • [A-Z] - Match uppercase letters from A-Z.
  • [.] - Match any character except line break characters.

Modifiers

Modifiers can be used to alter the behavior of regex strings. They are typically wrapped in brackets and start with a question mark. Many modifiers are implementation-dependent, but below are some example characters.

  • (?c) - Turns off case sensitivity.
  • (?s) - Make the dot character include matches for line break characters.

Chaining regular expressions

Regular expressions can be chained together using the pipe character (|). This allows for multiple search options to be acceptable in a single regex string. For example, the regex string '(string1|string2|string3)' will search for 'string1', 'string2', and 'string3' within the same query, rather than having to run 3 separate queries. These can be chained together with any other regex character and with virtually no limitations as to how many.

Quantifiers in regular expressions

Quantifiers allow you to specify how many times you want a particular regex string to match. Quantifiers usually come in two variants: lazy and greedy. By default, regex matching is eager and will match as much as possible, which is not always the desired behavior. Lazy quantifiers allow you to limit how much is matched, and you can further specify how many times matches are found with other limiting characters, such as:

  • * - Match 0 or more times.
  • + - Match 1 or more times
  • { n } - Match exactly n times.

Advanced regular expressions

Regular expressions support advanced concepts, such as recursion, backreferencing, grouping, subroutines, conditionals, and more. These features allow you to find very specific information within large data sets, and you can even create regex strings to find results within results.

Example of a regular expression in web server administration

Regex can be very useful for web server administrators as a tool to facilitate routing and searching for information. Log files follow typical patterns, so regex can be an easy way to help find specific messages in files that can be very large, like access logs.

Regex is also used by server software, such as Apache. Apace uses .htaccess files for rewrite rules and rewrite conditions, that are used to dictate how a server should respond to requests. Regex can be used in .htaccess files to interpret incoming URL access requests and re-route or reject them as needed.

For example, a typical line in a .htaccess file that uses regex to aid in routing requests looks like this:

RewriteRule ^index\.php$ - [L]

This uses regex to match any server request for a URL containing index.php, regardless of what comes after. Without mod_rewrite, a typical URL might look like: www.example.com/index.php?p=123, but using regex and mod_rewrite, instead, the same page can be accessed using a URL more like: www.example.com/my-blog-post

Importance for SEO

URLs that contain a page's main keywords are very beneficial for SEO. Search engines, like Google, consider keywords in URLs when ranking pages. Using keywords in URLs helps both search engines and regular users understand what sort of content to expect when they access a page.

Regex is used in .htaccess files to help achieve this. Popular CMS, such as WordPress, use this approach, which lets admins create and edit custom URL slugs for pages, without having to edit any files. For example, when searching for results on an e-commerce site, behind the scenes the parameters of a search query might be included in the URL, generating a URL such as: www.example.com/results.php?category=461&color=green&size=l

.htaccess uses regex to identify the individual components of a URL string, such as color=green and category=461, and replace them with a much better URL for users and search engines, like: www.example.com/search-results/toys/green/large. Behind the scenes, everything will work the same, but this URL is much better for SEO; the content of the page is clearer and it is a much cleaner URL.

Related links

Similar articles