Python正则表达式教程-常用文本处理技巧

发布时间：2019-11-05 19:53:39 所属栏目：优化来源：数据大视界

导读：副标题#e# 介绍：正则表达式用于识别模式(pattern)是否存在于给定的字符(字符串)序列中。它们有助于处理文本数据，这通常是涉及文本挖掘的数据科学项目的先决条件。您一定遇到过一些正则表达式的应用程序：它们在服务器端用于在注册过程中验证电子邮件地址

如果字符串开头的零个或多个字符与模式匹配，则返回相应的匹配对象。否则None，如果字符串与给定的模式不匹配，则返回。

pattern = "C" 
sequence1 = "IceCream" 
# No match since "C" is not at the start of "IceCream" 
re.match(pattern, sequence1) 
sequence2 = "Cake" 
re.match(pattern,sequence2).group() 
'C'

search() 与 match()

该match()函数仅在字符串的开头检查匹配项(默认情况下)，而该search()函数在字符串的任何位置检查匹配项。

findall(pattern, string, flags=0)

查找整个序列中所有可能的匹配项，并将它们作为字符串列表返回。每个返回的字符串代表一个匹配项。

email_address = "Please contact us at: support@datacamp.com, xyz@datacamp.com" 
#'addresses' is a list that stores all the possible match 
addresses = re.findall(r'[w.-]+@[w.-]+', email_address)for address in addresses:  
 print(address) 
support@datacamp.com 
xyz@datacamp.com

sub(pattern, repl, string, count=0, flags=0)

这就是substitute功能。它返回通过用替换替换或替换字符串中最左边的非重叠模式所获得的字符串repl。如果找不到该模式，则该字符串将原样返回。

email_address = "Please contact us at: xyz@datacamp.com" 
new_email_address = re.sub(r'([w.-]+)@([w.-]+)', r'support@datacamp.com', email_address) 
print(new_email_address) 
Please contact us at: support@datacamp.com

compile(pattern, flags=0)

将正则表达式模式编译为正则表达式对象。当您需要在单个程序中多次使用表达式时，使用该compile()函数保存生成的正则表达式对象以供重用会更有效。这是因为compile()缓存了传递给的最新模式的编译版本以及模块级匹配功能。

pattern = re.compile(r"cookie") 
sequence = "Cake and cookie" 
pattern.search(sequence).group() 
'cookie' 
# This is equivalent to: 
re.search(pattern, sequence).group() 
'cookie'

提示：可以通过指定flags值来修改表达式的行为。您可以flag在本教程中看到的各种功能中添加一个额外的参数。一些使用的标志是：IGNORECASE，DOTALL，MULTILINE，VERBOSE，等。

案例研究：使用正则表达式

通过学习一些示例，您已经了解了正则表达式在Python中的工作方式，是时候动手了!在本案例研究中，您将运用自己的知识。

import reimport requests 
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt' 
 
def get_book(url): 
 # Sends a http request to get the text from project Gutenberg 
 raw = requests.get(url).text 
 # Discards the metadata from the beginning of the book 
 start = re.search(r"*** START OF THIS PROJECT GUTENBERG EBOOK .****",raw ).end() 
 # Discards the metadata from the end of the book 
 stop = re.search(r"II", raw).start() 
 # Keeps the relevant text 
 text = raw[start:stop] 
 return text 
 
def preprocess(sentence):  
 return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower() 
 
book = get_book(the_idiot_url) 
processed_book = preprocess(book) 
print(processed_book)

在语料库中找到代词" the"的编号。提示：使用len()功能。

len(re.findall(r'the', processed_book)) 
302

（编辑：江门站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

5/7

首页

尾页

SEO推广，排行是唯一的	如果创建企业网站单页
企业产品页优化，需要	SEO速成提升网站可见