9.正则表达式 – 堂-DayDayUP

正则表达式

正则表达式概念：规则表达式（一套特殊的规则）
正则表达式的作用：
- 验证数据的有效性（查找）
- 替换文本内容
- 从字符串中提取子字符串（爬虫思想）

匹配单个字符

. 匹配任意单个字符 (除\n)
[] 列举，匹配[] 中列举的内容
- [ab] 匹配 a 或者 b
- [a-z] 匹配所有的小写字母
- [A-Z] 匹配大写字母
- [0-9] 匹配数字
- [a-zA-Z] 匹配所有的小写字母和大写字母
\d 匹配所有的数字等价于 [0-9]
\D 非数字
\s 空白
\S 非空白
\w 匹配字母、数字、下划线
- [a-zA-Z0-9_]
\W 非数字、非字母、非下划线

匹配多个字符

* 表示前一个字符出现 0次或者无限次
+ 表示前一个字符出现 1次或者无限次
? 表示前一个字符出现 0 次或者 1次（要不不出现，要不只能出现一次）
{m} 表示前一个字符，连续出现 m次
{m,n} 表示前一个字符，连续出现最少m次，最多n次

匹配开头结尾

^ 表示匹配以后一个字符开头
- ^ 有两个作用：匹配以指定字符开头
- ^[a-zA-Z_]+\w # 必须以小写字母、大写字母、下花线开头
- 用在 [] 内部，用于取反：[^he] 匹配不含有 h 和 e 的字符
\$ 表示匹配以前一个字符结尾：\d$ 以数字结尾

re模块

re模块的作用： python提供的用于正则操作的模块

re模块的使用步骤：

导入模块：import re
使用match() 方法进行检测

判断是否检测/匹配成功

result = re.match("\w{4,20}@163.com", "hello@163.com")
if result:
print("成功")
else:
print("失败")

取出匹配的具体内容

result.group()    获取匹配的具体内容 hello@163.com
result.start()       0
result.end()        13
result.span()       (0, 13)

r 加在正则前的作用：让正则中的 \ 表示原生的含义，仅仅对 \起作用

import re

s = '11python itheima python itheima python itheima'

# 通过match()方法验证正则 re.match("正则表达式", "要验证/检测的字符串")
# match()方法如果匹配成功，返回match object对象，如果匹配失败，返回值 None
# 匹配规则前加上r使规则内的\转义字符无效，作为普通字符串使用，解决提醒：SyntaxWarning: invalid escape sequence
result1 = re.match(r"^\d.*", s)

result = re.match(r"\w{4,20}@163.com", "hello@163.com")

if result:
    print("匹配成功", result.group())
else:
    print("匹配失败")

匹配成功 hello@163.com

匹配分组之"|"

"|" 的作用：或者关系，多个正则表达式满足任意一个都可以

^[0-9]?[0-9]$|^100$   # 匹配0-100 ^[0-9]?[0-9]$ 满足或者 ^100$ 满足任意一个，返回值都是一个

import re

result = re.match("^[0-9]?[0-9]$|^100$", "99")

if result:
    print("匹配成功", result.group())
else:
    print("匹配失败")

匹配成功 99

匹配分组之"()"

分组，整体匹配
- result = re.match("\w{4,20}@(163|126|qq|sina).com$", "hello@126.com")
- 把 @ .... ".com" 之间的内容整体进行匹配

提取子字符串

result = re.match("(\d{3,4})-(\d{7,8})", "010-12345678")
result.group()  获取的是匹配的结果 
result.group(1)  获取的是第一个括号中的内容
result.group(2)  获取的是第二个括号中的内容

import re

result = re.match(r"(\w{4,20})@(163|126|qq|sina)\.com$", "hello@126.com")

if result:
    print("匹配成功", result.group())
    print("获取的是第一个括号中的内容", result.group(1))
    print("获取的是第二个括号中的内容", result.group(2))
else:
    print("匹配失败")

result = re.match(r"(\d{3,4})-(\d{7,8})", "010-12345678")

if result:
    print("匹配成功", result.group())
    print("获取的是第一个括号中的内容", result.group(1))
    print("获取的是第二个括号中的内容", result.group(2))
else:
    print("匹配失败")

匹配成功 hello@126.com
获取的是第一个括号中的内容 hello
获取的是第二个括号中的内容 126
匹配成功 010-12345678
获取的是第一个括号中的内容 010
获取的是第二个括号中的内容 12345678

匹配分组之‘\’及别名

引用分组

\1    表示引用第一组
result = re.match("<([a-zA-Z0-9]+)>.*", "asdbasldfj")
result2 = re.match("<([a-zA-Z0-9]+)><([a-zA-Z0-9]+)>.*", "asdbj")
\\1 表示引用第一组 ，\\是转义字符，转义后代表一个 \
\\2 表示引用第二组

分组起别名
- 起名：?P<给分组起的别名>
- 引用别名：(?P=给分组起的别名)
- 整体代码：
```
result = re.match("<(?P[a-zA-Z0-9]+)><(?P[a-zA-Z0-9]+)>.*", "asdbj")
```

import re

result = re.match(r"<([a-zA-Z0-9]+)>.*</\1>", "<html>asdbasldfj</html>")    # r 使用\转义无效，\符号为符号本身\1表示引用分组1
result = re.match("<([a-zA-Z0-9]+)>.*</\\1>", "<html>asdbasldfj</html>")    # 同上 不使用r需要再使用一个\让使规则内的\为字符本身
if result:
    print("匹配成功", result.group())
else:
    print("匹配失败")

result2 = re.match("<([a-zA-Z0-9]+)><([a-zA-Z0-9]+)>.*</\\2></\\1>", "<html><h1>asdbj</h1></html>")
if result2:
    print("匹配成功", result2.group())
else:
    print("匹配失败")

result3 = re.match("<(?P<name1>[a-zA-Z0-9]+)><(?P<name2>[a-zA-Z0-9]+)>.*</(?P=name2)></(?P=name1)>", "<html><h1>asdbj</h1></html>")
if result3:
    print("匹配成功", result3.group())
    print("匹配成功", result3.group(1), result3.group(2))
else:
    print("匹配失败")

匹配成功<html>asdbasldfj</html>
匹配成功 <html><h1>asdbj</h1></html>
匹配成功 <html><h1>asdbj</h1></html>
匹配成功 html h1

re模块的高级用法

search() 在需要匹配的字符串中搜索要匹配的内容
- match 和 search的区别
- match 从需要检测.group的字符串的开头位置匹配，如果失败返回 None
- search 从需要检测的字符串中搜索满足正则的内容，有则返回match object对象
findall() 在需要匹配的字符串中查找所有满足正则的内容，返回值是列表
sub("正则表达式", "新的内容", "要替换的字符串") 字符串替换（按照正则，查找字符串并且替换为指定的内容）返回值是替换后的字符串
split("正则表达式", "待拆分的字符串") 按照正则拆分字符串，返回值是一个列表

import re
s = '阅读次数:9999'
result = re.search(r"\d+", s)
if result:
    print("匹配成功", result.group())
else:
    print("匹配失败")

result = re.findall(r"\d+", "阅读次数:9999,转发次数：6666,评论次数：38")
print(type(result), result)

result = re.sub(r"\d+", "10000", "阅读次数:9999,转发次数：6666,评论次数：38")
print(type(result), result)

result = re.split(r":| |@|\.", "info:hello@163.com zhangsan lisi")
print(type(result), result)

匹配成功 9999
 ['9999', '6666', '38']
 阅读次数:10000,转发次数：10000,评论次数：10000
 ['info', 'hello', '163', 'com', 'zhangsan', 'lisi']

贪婪和非贪婪

贪婪：默认，表示在满足正则的情况尽可能多的取内容
非贪婪：表示在满足正则的情况下，尽可能少的取内容
贪婪的转变为非贪婪：在 * ? + {} 的后面再加上？就可以了

import re
result = re.match(r"aaa(\d+?)", "aaa123456")
print(result.group())

aaa1

简单爬虫-批量获取电影下载链接

一、定义函数获取列表页的内容页地址 get_movie_links()
1、定义列表的地址 http://www.ygdy8.net/html/gndy/dyzz/list_23_1.html
2、打开url地址，获取数据
3、解码获取到的数据
4、使用正则得到所有的影片内容也地址
     4.1 遍历，取出内容页地址
     4.2 拼接内容页地址
     4.3 打开内容页地址
     4.4 获取数据，并读取
     4.5 解码内容页数据，得到html内容页文本
     4.6 使用正则，获取下载地址的连接
     4.7 把影片信息和下载链接，保存到字典中
     4.8 返回字典
二、主函数 main
1、调用 get_movie_list() ，得到字典
2、遍历字典，显示下载的内容

import urllib.request
import re

film_dict = {}

def get_movie_links():
    # 获取列表页影片信息
    # 定义列表页的地址
    film_list_url = "https://www.ygdy8.net/html/gndy/dyzz/list_23_1.html"
    # 打开URL，获取数据
    response_list = urllib.request.urlopen(film_list_url)
    # 通过read()读取网络资源数据
    response_list_data = response_list.read()
    response_list_txt = response_list_data.decode("GBK")
    # 使用正则得到所有影片内容地址
    url_list = re.findall(r"<a href=\"(.*)\" class=\"ulink\">\w+《(.*)》", response_list_txt)
    # 使用循环遍历
    global film_dict
    for content_url, film_name in url_list:
        content_url = "https://www.ygdy8.net" + content_url
        # print(f"影片名称：<<{film_name}>> -> 影片地址：{content_url}")
        response_content = urllib.request.urlopen(content_url)
        response_content_txt = response_content.read().decode("GBK")
        # print(response_content_txt)
        response_content_bt = re.search(r"<a target=\"_blank\" href=\"(.*)\"><strong>", response_content_txt)
        # 影片名与磁链保存至字典
        film_dict[film_name] = response_content_bt.group(1)
        print(film_dict)
        # print(f"影片名称：<<{film_name}>> -> 影片磁链地址：{response_content_bt.group(1)}")
        # 影片名与磁链保存至文件
        # with open("./电影下载.txt", "a", encoding="UTF-8") as f:
        #     f.write(f"影片名称：<<{film_name}>> -> 影片磁链地址：{response_content_bt.group(1)}\n")

def main():
    get_movie_links()

if __name__ == '__main__':
    main()
    print(film_dict)

# 使用findall与search
import re
s = '<a target="_blank" href="magnet:?xt=urn:btih:3f17d9e82018c39f8385092a6598020540032e36&dn=%e9%98%b3%e5%85%89%e7%94%b5%e5%bd%b1dygod.org.%e6%83%8a%e5%a5%87%e9%98%9f%e9%95%bf2.2023.BD.1080P.%e5%9b%bd%e8%8b%b1%e5%8f%8c%e8%af%ad%e4%b8%ad%e5%ad%97.mkv&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce&tr=udp%3a%2f%2fexodus.desync.com%3a6969%2fannounce"><strong>'

result = re.findall(r"<a target=\"_blank\" href=\"(.*)\"><strong>", s)
print(result[0])

result = re.search(r"<a target=\"_blank\" href=\"(.*)\"><strong>", s)
print(result.group(1))

简单爬虫-批量获取电影下载链接-多线程

"""
定义函数，实现获取影片地址 get_movie_link
1、设置爬去的电影列表页面
2、打开电影列表页，获取数据，并解码得到网页html文本内容
3、使用正则匹配获得 页面中的所有的影片名称和对应内容页连接
4、循环打开内容页，获取下载地址
5、保存影片名称和地址到字典中

定义主函数main 调用 get_movie_link 函数，获取地址
"""
import re
import urllib.request
import threading

# 定义字典保存下载的影片信息
class Spider(object):

    def __init__(self):
        self.film_dict = {}
        self.i = 1
        self.lock1 = threading.Lock()

    def start(self):

        # 调用下载函数，获取下载连接
        for page in range(1, 2):
            t1 = threading.Thread(target=self.get_movie_link, args=(page,))
            t1.start()
            t1.join()
        # 得到字典对应的数组
        list1 = self.film_dict.items()

        # # 所有线程执行完毕后再退出
        # while len(threading.enumerate()) != 1:
        #     time.sleep(1)
        # t1.join()

        # 遍历下载字典，获取影片名称、下载地址
        for film_name, film_download_url in list1:
            print(film_name, "|", film_download_url)

    def get_movie_link(self, page):
        # 获取列表页影片信息
        # 定义列表页的地址
        film_list_url = "https://www.ygdy8.net/html/gndy/dyzz/list_23_%d.html" % page
        # 打开URL，获取数据
        response_list = urllib.request.urlopen(film_list_url)
        # 通过read()读取网络资源数据
        response_list_data = response_list.read()
        response_list_txt = response_list_data.decode("GBK")
        # 使用正则得到所有影片内容地址
        url_list = re.findall(r"<a href=\"(.*)\" class=\"ulink\">\w+《(.*)》", response_list_txt)
        # 使用循环遍历
        for content_url, film_name in url_list:
            content_url = "https://www.ygdy8.net" + content_url
            # print(f"影片名称：<<{film_name}>> -> 影片地址：{content_url}")
            response_content = urllib.request.urlopen(content_url)
            response_content_txt = response_content.read().decode("GBK")
            # print(response_content_txt)
            response_content_bt = re.search(r"<a target=\"_blank\" href=\"(.*)\"><strong>", response_content_txt)
            # 影片名与磁链保存至字典
            self.film_dict[film_name] = response_content_bt.group(1)
            # print(film_dict)
            # print(f"影片名称：<<{film_name}>> -> 影片磁链地址：{response_content_bt.group(1)}")
            # 影片名与磁链保存至文件
            # with open("./电影下载.txt", "a", encoding="UTF-8") as f:
            #     f.write(f"影片名称：<<{film_name}>> -> 影片磁链地址：{response_content_bt.group(1)}\n")

def main():
    film_spider = Spider()
    film_spider.start()

if __name__ == '__main__':
    main()