统计英语单词工具V1.0
本来我是准备好好研究一下爬虫的,结果早晨起来读英语的时候我发现如果能写一个简单的程序将我读书过程中的生词记录下来集合成一张纸,然后去背掉它的话,那岂不是妙极?这个也顺便练习一下做爬虫的正则表达式的能力。
当然,正则表达式是我早就想学习的东西了,不过一直没有找到机会。
好吧,我们开始今天先写一个比较简单的程序来统计英语单词,然后我们根据这个再改动一下,看看能不能进行更加细腻的操作,看看能不能把输入变得更加完善。
首先,我们需要一个txt文档,这里面都是英语文章,然后我们运行程序。
老规矩,小程序先贴代码,后讲解
# -*- coding: utf-8 -*-
#使用方法:把文本用ANSI编码存下来
# 把文章存到成input.txt中并且放到C盘根目录下面,这样比较方便操作
import re
import string
#输出文件
output_file = open("C:\\result.txt","w")
#输入文本文件
input_file = open("C:\\input.txt","r")
strs =input_file.read()
#使用正则表达式,把单词提出出来,并都修改为小写格式
s = re.findall("\w+",str.lower(strs))
# 返回一个列表
#去除列表中的重复项,并排序
l = sorted(list(set(s)))
for i in l:
m = re.search("\d+",i)
n = re.search("\W+",i)
if not m and not n and len(i)>4:
output_file.write(i +" : "+str(s.count(i))+"\n")
# 不属于数字也不属于非(英文+数字)并且字母长度大于4的集合
input_file.close()
output_file.close()
好,我们先复制以下的内容,存储到C盘根目录下,文件名 input.txt
first metacharacters we’ll look at are [ and ]. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, [abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z].
Metacharacters are not active inside classes. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.
You can match the characters not listed within the class by complementing the set. This is indicated by including a '^' as the first character of the class; '^' outside a character class will simply match the '^' character. For example, [^5] will match any character except '5'.
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
Some of the special sequences beginning with '\' represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.
运行程序
成果如下