【转】Regex For Noobs (like me!) - An Illustrated Guide

https://www.janmeppe.com/blog/regex-for-noobs/

This blog post is an illustrated guide to regex and aims to provide a gentle introduction for people who never have fiddled with regex, want to, but are kind of intimidated by the whole thing.

In other words, welcome to …

image

For most people without a formal CS education, regular expressions (regex) can come off as something that only the most hardcore unix programmers would dare to touch.

A good regex can seem like magic, but remember this: any technology sufficiently advanced enough is indistinguishable from magic. So, let’s peel away at the magic of regex and see what’s underneath!

If you understand regex it suddenly becomes this super fast and powerful tool … but you first need to understand it, and honestly I find it a bit intimidating for newcomers!

Let’s start with the basics. What are regular expressions (regex) and what are they used for?

Regex for noobs

At its core, a regular expression is a sequence of characters defining a search pattern.

Regex is often used in tools like grep to find patterns in longer strings of text.

Consider a file cat.txt

cat
cat2
dog

If we use the regex cat to search for matches we find the following matches.

cat
cat2

(Note for the hardcore powerusers: In this post I’m going to conflate the terms regex and the tools that use regex such as grep. This is technically wrong, I am aware of this.)

Regex works on characters, not words

One important thing that can not be emphasized enough is this: regex works on characters, not words. Concatenation is implied.

image

If we search using regex for the pattern cat, we are not looking for the word “cat”, but we are looking for c followed by a followed by t.

The dot and the asterisk

The most basic characters are the single characters, like a, b, c, etc. Now let’s introduce two special guests.

image

The . (dot) character matches any single character. For example, if we search for c.t we would match anything ranging from cat to c0t or cAt, we would match any single c followed by any character followed by a single t.

The * (asterisk) character is a bit more difficult. It modifies the character preceding it and then matches zero or more characters of that. Yes, read that again, zero or more characters. For example, cat*would match cat, catt, cattttt but also ca.

The cat ate my homework

Imagine we read in a file line by line and the first line is as follows.

The cat ate my homework.

Let’s look at how we would match the pattern cat in this line.

image

We start with matching the first character of the pattern to the first character in the sentence.

If we don’t find a match we skip to the next character in the line and start from the first character of the pattern.

If we do find a match we go to the next character in both the pattern and the line and repeat this process. When we find a match for the whole pattern we return the line in which we find a match.

That’s it! That’s what regex is most often used for at its most basic level, to find a smaller search pattern in a larger string.

So far we’ve gone over what regex is and two of the special characters, the . (dot) and the * (asterisk), but wait, there’s more.

The regex trifecta

Zooming out a little bit, the parts of a regex can consist of three different components:

  1. Anchors
  2. Character sets
  3. Modifiers

These three make up the … regex trifecta!

image

Let’s start with the first part of the trifecta: anchors!

Anchors

image

Anchors specify the position of the pattern with respect to the line. These are the two most important anchors:

  • The ^ (caret) fixes your pattern to the beginning of the line. For example the pattern ^1 matches any line starting with a 1.
  • The $ (dollar) fixes your pattern to the end of the sentence. For example, 9$ matches any line ending with a 9.

Note that in both cases the pattern has to be respectively first and last in the pattern. ^1 matches a 1 at the start of a line but 1^ matches a 1 followed by a ^. Similarly, 1$ matches lines ending with a 1 but $1matches a dollar sign followed by a 1 anywhere on the line.

On to the second part of the trifecta: character sets!

Character sets

image

The second part of the trifecta: character sets. Character sets are the bread and butter of regex. A single character, say a, is the most atomic character set (a set of one element). But we can do crazy stuff with regex like [0-9] which matches any single digit, or if you recall what *does we can make the pattern [0-9][0-9]* (what this pattern matches is left as an exercise to the reader).

Some other important character sets:

  • [0-9] matches any single digit from 0...9
  • [a-z] matches any lowercase character
  • [A-Z] matches any uppercase characer

We can also combine multiple sets:

  • [A-Za-z0-9] matches any uppercase and lowercase letter and single digit.

Finally, modifiers.

Modifiers

I don’t want to get too much into depth here, but we already came across a modifier! The * (asterisk) is a modifier. A modifier changes the meaning of the character preceding it. There are many other modifiers but starting with * is a good start.

An actual example

Let’s quickly dump some text in a file

$ echo "The cat jumps long time \nThen we also have the fact that these are words.\n1234 this is a test post please ignore." >> grep.txt

This is what’s in the file now

$ cat grep.txt
The cat jumps long time
Then we also have the fact that these are words.
1234 this is a test post please ignore.

Let’s look for cat

$ grep "cat" grep.txt
The cat jumps long time

Let’s look for any line starting with a digit ^[0-9]

$ grep "^[0-9]" grep.txt
1234 this is a test post please ignore.

That’s it! You just used regular expressions. Awesome.

Summary

In this blog post we went over:

  • Basic functionality of regex
  • The three main components of regex: anchors, character sets, and modifiers.
  • The . (dot), * (asterisk), ^ (caret), and $ (dollar sign).
  • Some character sets [0-9], [a-z], [A-Z], and combinations.

The goal of this blog post was to make regex a bit more approachable by means of an illustrated introduction.

If you peel away at the technical difficulties what you end up with is a relatively simple but super powerful tool that will prove invaluable in any data scientists’ toolbelt.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 200,045评论 5 468
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 84,114评论 2 377
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 147,120评论 0 332
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 53,902评论 1 272
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 62,828评论 5 360
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,132评论 1 277
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,590评论 3 390
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,258评论 0 254
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,408评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,335评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,385评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,068评论 3 315
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,660评论 3 303
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,747评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,967评论 1 255
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,406评论 2 346
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 41,970评论 2 341

推荐阅读更多精彩内容