LPI E - Data Manipulation
3.2 Searching and Extracting Data from Files
Regular Expressions (regex)
Part 2 of 2
Part 1: Data Manip Tools
regex is a powerful tool for searching and manipulating text in Linux and are patterns that describe a set of strings. It is commonly used in commands like grep, sed, awk, and others.
It be used in combination with Linux terminal commands to filter and manipulate text output.
Here are some basic examples of how to use regular expressions for I/O redirection with Linux terminal commands (I/O redirection is a way to redirect the input or output of a command to a file or another command.)
The dot (.) character
For example, the regular expression "c.t" matches any string that contains a "c" followed by any single character and then a "t".
This would match "cat", "cot", "cut", and so on.
Matching any character:
$ ls | grep '.*\...'
In this command, the dot (.) character matches any single character, and the asterisk (*) character matches zero or more of the previous character (in this case, any character).
The backslash (\) character is used to escape the dot and match it literally. The three dots at the end of the regular expression match the three-letter extension of the file.
Matching a specific character after a pattern:
$ grep 'color.' file.txt
In this command, the dot (.) character matches any single character, so the regular expression color. matches
- "colorful"
- "color2"
- "color_"
- ... and so on.
Matching a sequence of characters with a variable character in the middle
$ grep 'app.*store' file.txt
In this command, the dot (.) character matches any single character, and the asterisk (*) character matches zero or more of the previous character (in this case, any character). The regular expression app.*store matches
- "app store"
- "app-store"
- "app:store"
- ... and so on.
Matching a specific pattern with a variable number of characters
$ grep 'p.*t' file.txt
In this command, the dot (.) character matches any single character, and the asterisk (*) character matches zero or more of the previous character (in this case, any character). The regular expression p.*t matches
- "put"
- "pit"
- "port"
- "part"
- ... and so on.
Matching a specific pattern with a fixed number of characters:
$ grep 'c..t' file.txt
In this command, the two dot (.) characters match any single character, so the regular expression c..t matches
- "coat"
- "cart"
- "city"
- ... and so on
The asterisk (*) character
For example, the regular expression "a*" matches "a", "ab", "abb", "abbb", and so on.
Matching any number of characters
$ ls | grep 'file.*'
In this command, the dot (.) character matches any single character, and the asterisk (*) character matches zero or more of the previous character (in this case, any character). The regular expression file.* matches
- "file.txt"
- "file.doc"
- "file.html"
- ... and so on.
Matching any string of characters
$ grep 'color.*' file.txt
In this command, the dot (.) character matches any single character, and the asterisk (*) character matches zero or more of the previous character (in this case, any character). The regular expression color.* matches
- "colorful"
- "color2"
- "color_"
- "colorless"
- "colorfulnes"
- ... and so on.
Matching zero or more occurrences of a character
$ grep 'red.*blue' file.txt
In this command, the dot (.) character matches any single character, and the asterisk (*) character matches zero or more of the previous character (in this case, any character). The regular expression red.*blue matches
- "redblue"
- "red:blue"
- "red and blue"
- "red-green-blue"
- ... and so on.
Matching any number of digits
$ grep '[0-9]*' file.txt
In this command, the square brackets ([ ]) enclose a character class that matches any digit (0-9), and the asterisk (*) character matches zero or more of the previous character (in this case, any digit). The regular expression [0-9]* matches
- "123"
- "0"
- "45678"
- ... and so on.
Matching any number of non-whitespace characters
$ grep '\S*' file.txt
In this command, the backslash (\) character is used to escape the uppercase S (\S), which matches any non-whitespace character. The asterisk (*) character matches zero or more of the previous character (in this case, any non-whitespace character). The regular expression \S* matches
- The quick brown fox jumps over the lazy dog.
- 123 Main St.
- This line has a few spaces
- Special Characters !@#$%^&*()
Note that lines that contain only whitespace characters (spaces, tabs, etc.) will not be returned by this command.
The plus (+) character
For example, the regular expression "ab+" matches "ab", "abb", "abbb", and so on.
Matching one or more occurrences of a character
$ grep 'hello!+' file.txt
In this command, the exclamation mark (!) matches the literal character, and the plus (+) character matches one or more of the previous character (in this case, the exclamation mark). The regular expression hello!+ matches
- "hello!"
- "hello!!"
- "hello!!!"
- ... and so on.
Matching one or more digits
$ grep '[0-9]+' file.txt
In this command, the square brackets ([ ]) enclose a character class that matches any digit (0-9), and the plus (+) character matches one or more of the previous character (in this case, any digit). The regular expression [0-9]+ matches
- "123"
- "45678"
- "0"
- ... and so on.
Matching one or more non-whitespace characters
$ grep '\S+' file.txt
In this command, the backslash () character is used to escape the uppercase S (\S), which matches any non-whitespace character. The plus (+) character matches one or more of the previous character (in this case, any non-whitespace character). The regular expression \S+ matches
- "The quick brown fox jumps over the lazy dog."
- "123 Main Street"
- "Email me at john@example.com"
- "The password is: 8#dkJ7f!2"
- "The total cost of the project is $10,000.00"
Matching one or more occurrences of a character class
$ grep 'red[0-9]+blue' file.txt
In this command, the square brackets ([ ]) enclose a character class that matches any digit (0-9), and the plus (+) character matches one or more of the previous character (in this case, any digit). The regular expression red[0-9]+blue matches
- "red1blue"
- "red123blue"
- "red0blue"
- ... and so on.
Matching one or more occurrences of a group of characters
$ grep '(hello|hi)+[[:space:]]+world' file.txt
In this command, the parentheses () enclose a group of characters (in this case, "hello" or "hi"), and the pipe (|) character separates alternative patterns within the group. The plus (+) character matches one or more occurrences of the previous pattern (in this case, the group of characters). The character class [[:space:]] matches any whitespace character
The question mark (?) character
For example, the regular expression "ab?" matches "a" and "ab".
Matching zero or one occurrence of a character
$ grep 'colou?r' file.txt
In this command, the question mark (?) character matches zero or one occurrence of the previous character (in this case, the letter "u"). The regular expression colou?r matches both
- "The colors of the rainbow are beautiful."
- "The colours of the flag represent our country."
- "I like to paint with watercolors."
- "The color of the sky changes throughout the day."
- "The colour of the leaves in autumn is stunning."
Matching a specific number of digits
$ grep '[0-9]{4}?' file.txt
In this command, the curly braces ({ }) enclose the number of occurrences of the previous character or group of characters (in this case, any digit [0-9]). The question mark (?) character makes the previous expression optional, so it matches both three-digit and four-digit numbers. The regular expression [0-9]{4}? matches
- "123"
- "4567"
- "0001"
- ... and so on.
Matching a range of characters
$ grep '[aeiou]?' file.txt
In this command, the square brackets ([ ]) enclose a character class that matches any of the enclosed characters (in this case, any vowel [aeiou]). The question mark (?) character makes the previous expression optional, so it matches both words with and without vowels.
The regular expression [aeiou]? matches
- "word" "vowel"
- "a"
- "e"
- "i"
- "o"
- "u".
Matching a pattern with or without a prefix
$ grep 'pie?apple' file.txt
In this command, the question mark (?) character makes the previous character (in this case, the letter "i") optional, so the regular expression matches
- both "apple" and "pieapple".
Matching a pattern with or without a suffix
$ grep 'example?' file.txt
In this command, the question mark (?) character makes the previous character (in this case, the letter "d") optional, so the regular expression matches
- both "example" and "exampled".
The square brackets ([ ]) set of characters.
For example, the regular expression "[abc]" matches any single character that is either "a", "b", or "c".
Matching a character set$ grep '[aeiou]' file.txt
[aeiou]
matches - "apple"
- "pie"
- "computer"
- ... and so on.
$ grep '^[ab][a-z]{2}$' file.txt
In this command, the regular expression ^[ab][a-z]{2}$
matches any string that starts with "a" or "b", followed by exactly two lowercase letters. The square brackets enclose a range of characters (in this case, any lowercase letter [a-z]).
$ grep '[^aeiou]' file.txt
In this command, the caret (^) inside the square brackets negates the character set, so it matches any character that is not a vowel.
The regular expression [^aeiou]
matches
- "sky"
- "fly"
- "computer"
- ... and so on.
$ grep '[0-9a-zA-Z]' file.txt
In this command, the square brackets enclose a union of character sets (in this case, digits [0-9] and lowercase and uppercase letters [a-zA-Z]).
The regular expression [0-9a-zA-Z]
matches
- "word1"
- "4ever"
- "TeSt"
- ... and so on.
$ grep '[e.]' file.txt
In this command, the dot (.) character matches any single character. The regular expression e.
matches
- "example"
- "eggs"
- "emacs"
- ... and so on.
The caret (^) character
For example, the regular expression "[^abc]" matches any single character that is not "a", "b", or "c".
Matching the beginning of a line
$ grep '^The' file.txt
In this command, the caret (^) character matches the beginning of a line. The regular expression ^ The matches
- "The quick brown fox"
- "The lazy dog"
- ... and so on.
Negating a character set
grep '^[^e]*$' file.txt
In this command, the regular expression ^[^e]*$ matches any line that does not contain the letter "e". The caret (^) character inside the square brackets negates the character set, so it matches any character that is not "e". The asterisk (*) character matches zero or more occurrences of the negated character set.
Matching a string at the beginning of a file
$ ls | grep '^README'
In this command, the caret (^) character matches the beginning of the regular expression. The regular expression ^README matches all files that start with "README".
Matching a string at the end of a line
$ grep 'the end$' file.txt
In this command, the dollar sign ($) character matches the end of a line. The regular expression the end$ matches
- "This is the end"
- "It's not the end"
- ... and so on.
Matching a negated string
$ grep '^[^T]' file.txt
In this command, the caret (^) character matches the beginning of the regular expression. The regular expression ^[^T] matches any line that does not start with "T". The negated character set inside the square brackets matches any character that is not "T".