首先,我们先要获取爱奇艺的电视剧排行,网址为http://v.iqiyi.com/index/dianshiju/index.html

我们可以看到这些电视剧的排名

我们要做的是首先获取网页源代码

headers={
    'User-Agent':'Mozilla/5.0(Macintosh;intel Mac OS 10_11_4)Applewebkit/537.36(KHTML,like Gecko)Chrome/52.0.2743.116 Safari/537.36'
 }
 url=('http://v.iqiyi.com/index/dianshiju/index.html')
re=requests.get(url,headers=headers)

得到源代码后,我们再进行分析源代码,这里展示一小部分源代码

<tr data-ranklist-elem="item" >
              <td class="col-title">
                <div class="rank_list">
                  <i class="icon_top icon_top-3" data-ranklist-elem="rank">1</i>
                                            <a href="http://so.iqiyi.com/so/q_亲爱的,热爱的" class="item_name" rseat="819k1_dianshiju" target='_blank'>亲爱的,热爱的</a>
                                    </div>
              </td>
                            <td>
                    <span class="item_usrInfo">
                                                                        <a href="http://so.iqiyi.com/so/q_杨紫" rseat="819s1_dianshiju" target='_blank'>杨紫</a>/                                                                                                <a href="http://so.iqiyi.com/so/q_李现" rseat="819s1_dianshiju" target='_blank'>李现</a>/                                                                                                <a href="http://so.iqiyi.com/so/q_胡一天" rseat="819s1_dianshiju" target='_blank'>胡一天</a>                                                                                                                                                                                                                                                                                                                                                                                    </span>
                </td>
                          <td class="col-num">
                <div class="rank_list" data-ranklist-yes="13059697">
                  <span class="item_num">13,059,697</span>
                                            <i class="tend tend_line" title="平稳"></i>
                                    </div>
              </td>
              <td class="col-num">
                <div class="rank_list" data-ranklist-week="76916679">
                  <span class="item_num">76,916,679</span>
                                            <i class="tend tend_line" title="平稳"></i>
                                    </div>
              </td>
              <td class="col-num">
                <div class="rank_list" data-ranklist-mon="103566713">
                  <span class="item_num">103,566,713</span>
                </div>
              </td>
            </tr>

我们在通过正侧表达式分析为

<tr data-ranklist-elem="item" .*?<i class=".*?" data-ranklist-elem="rank">(.*?)</i>.*? rseat=".*?" target=.*?>(.*?)</a>.*? <a href=".*?" rseat=".*?" target=.*?>(.*?)</a>.*?<a href=".*?" rseat=".*?" target=.*?>(.*?)</a>.*?<a href=".*?" rseat=".*?" target=.*?>(.*?)</a>.*?<div class="rank_list" data-ranklist-yes=".*?">.*?<span class="item_num">(.*?)</span>.*?<i class=".*?" title="(.*?)">.*?<div class="rank_list" data-ranklist-week=".*?">.*? <span class="item_num">(.*?)</span>.*?<i class=".*?" title="(.*?)">.*?<div class=".*?" data-ranklist-mon=".*?">.*?<span class="item_num">(.*?)</span>.*?

正则表达式

知识如下

正则表达式 - 语法

正则表达式(regular expression)描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。

例如:

  • runoo+b,可以匹配 runoob、runooob、runoooooob 等,+ 号代表前面的字符必须至少出现一次(1次或多次)。
  • runoo*b,可以匹配 runob、runoob、runoooooob 等,* 号代表字符可以不出现,也可以出现一次或者多次(0次、或1次、或多次)。
  • colou?r 可以匹配 color 或者 colour,? 问号代表前面的字符最多只可以出现一次(0次、或1次)。

构造正则表达式的方法和创建数学表达式的方法一样。也就是用多种元字符与运算符可以将小的表达式结合在一起来创建更大的表达式。正则表达式的组件可以是单个的字符、字符集合、字符范围、字符间的选择或者所有这些组件的任意组合。

正则表达式是由普通字符(例如字符 a 到 z)以及特殊字符(称为"元字符")组成的文字模式。模式描述在搜索文本时要匹配的一个或多个字符串。正则表达式作为一个模板,将某个字符模式与所搜索的字符串进行匹配。

总的代码:

import requests
import re
import json
import time
from requests.exceptions import RequestException

def get_one_page(url):
    try:
        headers={
            'User-Agent':'Mozilla/5.0(Macintosh;intel Mac OS 10_11_4)Applewebkit/537.36(KHTML,like Gecko)Chrome/52.0.2743.116 Safari/537.36'
        }
        re=requests.get(url,headers=headers)
        if re.status_code==200:
            return re.text
        return None
    except RequestException:
        return None


def parse_one_page(html):
    pattern=re.compile('<tr data-ranklist-elem="item" .*?<i class=".*?" data-ranklist-elem="rank">(.*?)</i>.*? rseat=".*?" target=.*?>(.*?)</a>.*? <a href=".*?" rseat=".*?" target=.*?>(.*?)</a>.*?<a href=".*?" rseat=".*?" target=.*?>(.*?)</a>.*?<a href=".*?" rseat=".*?" target=.*?>(.*?)</a>.*?<div class="rank_list" data-ranklist-yes=".*?">.*?<span class="item_num">(.*?)</span>.*?<i class=".*?" title="(.*?)">.*?<div class="rank_list" data-ranklist-week=".*?">.*? <span class="item_num">(.*?)</span>.*?<i class=".*?" title="(.*?)">.*?<div class=".*?" data-ranklist-mon=".*?">.*?<span class="item_num">(.*?)</span>.*?',re.S)
    items=re.findall(pattern,html)
    print(items)
    for item in items:
        yield {
            'index': item[0],
            #'image': item[1],
            'title': item[1],
            'actor': item[2:5],
            'data-ranklist-yes':item[5],
            'tend_line':item[6],
            'data-ranklist-week':item[7],
            'tend_line1': item[8],
           # 'data-ranklist-mon': item[9]
           # 'time': item[4].strip()[5:],

        }


def save_one_page(content):
    with open('re.txt', 'a', encoding='utf-8')as f:
        print(type(json.dumps(content)))
        f.write(json.dumps(content, ensure_ascii=False) + '\n')


def main(offest):
    url = ('http://v.iqiyi.com/index/dianshiju/index.html')
    html = get_one_page(url)
    for item in parse_one_page(html):
        save_one_page(item)


if __name__ == '__main__':
    for i in range(1):
        main(offest=i * 10)
{"index": "1", "title": "亲爱的,热爱的", "actor": ["杨紫", "李现", "胡一天"], "data-ranklist-yes": "13,059,697", "tend_line": "平稳", "data-ranklist-week": "76,916,679", "tend_line1": "平稳"}
{"index": "2", "title": "请赐我一双翅膀", "actor": ["鞠婧祎", "炎亚纶", "韩栋"], "data-ranklist-yes": "2,401,901", "tend_line": "平稳", "data-ranklist-week": "15,533,356", "tend_line1": "平稳"}
{"index": "3", "title": "少年派", "actor": ["张嘉译", "闫妮", "赵今麦"], "data-ranklist-yes": "1,801,400", "tend_line": "平稳", "data-ranklist-week": "15,471,047", "tend_line1": "平稳"}
{"index": "4", "title": "神犬小七3", "actor": ["姜潮", "宋妍霏", "徐可"], "data-ranklist-yes": "1,720,858", "tend_line": "上升", "data-ranklist-week": "10,596,919", "tend_line1": "上升"}
{"index": "5", "title": "流淌的美好时光", "actor": ["马天宇", "郑爽", "柴碧云"], "data-ranklist-yes": "1,454,898", "tend_line": "下降", "data-ranklist-week": "10,499,779", "tend_line1": "下降"}
{"index": "6", "title": "爱来的刚好", "actor": ["韩栋", "江铠同", "李威"], "data-ranklist-yes": "833,886", "tend_line": "平稳", "data-ranklist-week": "7,124,209", "tend_line1": "平稳"}
{"index": "7", "title": "带着爸爸去留学", "actor": ["孙红雷", "辛芷蕾", "曾舜晞"], "data-ranklist-yes": "830,028", "tend_line": "平稳", "data-ranklist-week": "7,673,427", "tend_line1": "平稳"}
{"index": "8", "title": "追球", "actor": ["范世錡", "卜冠今", "李艺彤"], "data-ranklist-yes": "807,514", "tend_line": "平稳", "data-ranklist-week": "8,054,590", "tend_line1": "平稳"}
{"index": "9", "title": "破冰行动", "actor": ["黄景瑜", "吴刚", "王劲松"], "data-ranklist-yes": "748,406", "tend_line": "平稳", "data-ranklist-week": "5,908,357", "tend_line1": "平稳"}
{"index": "10", "title": "爱情公寓4", "actor": ["娄艺潇", "陈赫", "邓家佳"], "data-ranklist-yes": "706,075", "tend_line": "上升", "data-ranklist-week": "4,692,537", "tend_line1": "上升"}
{"index": "11", "title": "陈情令", "actor": ["肖战", "王一博", "孟子义"], "data-ranklist-yes": "700,983", "tend_line": "平稳", "data-ranklist-week": "4,699,911", "tend_line1": "平稳"}
{"index": "12", "title": "时间都知道", "actor": ["唐嫣", "窦骁", "杨烁"], "data-ranklist-yes": "699,655", "tend_line": "平稳", "data-ranklist-week": "3,460,056", "tend_line1": "平稳"}
{"index": "13", "title": "归还世界给你", "actor": ["杨烁", "古力娜扎", "徐正溪"], "data-ranklist-yes": "647,017", "tend_line": "上升", "data-ranklist-week": "1,658,061", "tend_line1": "上升"}
{"index": "14", "title": "宸汐缘", "actor": ["张震", "倪妮", "李东学"], "data-ranklist-yes": "640,628", "tend_line": "平稳", "data-ranklist-week": "4,787,852", "tend_line1": "平稳"}
{"index": "15", "title": "河神", "actor": ["李现", "张铭恩", "王紫璇CiCi"], "data-ranklist-yes": "578,966", "tend_line": "下降", "data-ranklist-week": "4,259,929", "tend_line1": "下降"}
{"index": "16", "title": "长安十二时辰", "actor": ["雷佳音", "易烊千玺", "周一围"], "data-ranklist-yes": "546,752", "tend_line": "下降", "data-ranklist-week": "4,397,608", "tend_line1": "下降"}
{"index": "17", "title": "七月与安生", "actor": ["沈月", "陈都灵", "熊梓淇"], "data-ranklist-yes": "497,474", "tend_line": "上升", "data-ranklist-week": "788,476", "tend_line1": "上升"}
{"index": "18", "title": "大宋少年志", "actor": ["张新成", "周雨彤", "郑伟"], "data-ranklist-yes": "474,354", "tend_line": "下降", "data-ranklist-week": "5,339,238", "tend_line1": "下降"}
{"index": "19", "title": "香蜜沉沉烬如霜", "actor": ["杨紫", "邓伦", "陈钰琪"], "data-ranklist-yes": "464,225", "tend_line": "下降", "data-ranklist-week": "3,242,294", "tend_line1": "下降"}
{"index": "20", "title": "李三枪", "actor": ["刘恩佑", "战菁一", "高叶"], "data-ranklist-yes": "443,685", "tend_line": "上升", "data-ranklist-week": "1,569,801", "tend_line1": "上升"}
{"index": "21", "title": "九州缥缈录", "actor": ["刘昊然", "宋祖儿", "陈若轩"], "data-ranklist-yes": "441,532", "tend_line": "下降", "data-ranklist-week": "3,274,041", "tend_line1": "下降"}
{"index": "22", "title": "老九门", "actor": ["陈伟霆", "张艺兴", "赵丽颖"], "data-ranklist-yes": "418,361", "tend_line": "下降", "data-ranklist-week": "3,374,370", "tend_line1": "下降"}
{"index": "23", "title": "我的前半生", "actor": ["靳东", "马伊琍", "袁泉"], "data-ranklist-yes": "417,136", "tend_line": "下降", "data-ranklist-week": "2,765,214", "tend_line1": "下降"}
{"index": "24", "title": "天雷一部之春花秋月", "actor": ["李宏毅", "赵露思", "吴俊余"], "data-ranklist-yes": "403,066", "tend_line": "上升", "data-ranklist-week": "2,209,010", "tend_line1": "上升"}
{"index": "25", "title": "三生三世十里桃花", "actor": ["杨幂", "赵又廷", "张智尧"], "data-ranklist-yes": "384,577", "tend_line": "下降", "data-ranklist-week": "2,621,556", "tend_line1": "下降"}
{"index": "26", "title": "我们的少年时代", "actor": ["王俊凯", "王源", "易烊千玺"], "data-ranklist-yes": "373,314", "tend_line": "下降", "data-ranklist-week": "2,571,446", "tend_line1": "下降"}
{"index": "27", "title": "我们不能是朋友", "actor": ["刘以豪", "郭雪芙", "夏若妍"], "data-ranklist-yes": "372,439", "tend_line": "下降", "data-ranklist-week": "3,307,221", "tend_line1": "下降"}
{"index": "28", "title": "我要和你在一起", "actor": ["柴碧云", "孙绍龙", "万思维"], "data-ranklist-yes": "310,126", "tend_line": "上升", "data-ranklist-week": "2,046,496", "tend_line1": "上升"}
{"index": "29", "title": "灵魂摆渡3", "actor": ["于毅", "刘智扬", "肖茵"], "data-ranklist-yes": "304,648", "tend_line": "上升", "data-ranklist-week": "2,053,201", "tend_line1": "上升"}
{"index": "30", "title": "白发", "actor": ["张雪迎", "李治廷", "经超"], "data-ranklist-yes": "303,315", "tend_line": "下降", "data-ranklist-week": "3,506,497", "tend_line1": "下降"}
{"index": "31", "title": "亮剑", "actor": ["新大头儿子和小头爸爸", "王浩宇(童星)", "陈创"], "data-ranklist-yes": "287,988", "tend_line": "下降", "data-ranklist-week": "1,873,257", "tend_line1": "下降"}
{"index": "33", "title": "娘道", "actor": ["岳丽娜", "于毅", "张少华"], "data-ranklist-yes": "285,508", "tend_line": "下降", "data-ranklist-week": "1,938,314", "tend_line1": "下降"}
{"index": "34", "title": "欢乐颂2", "actor": ["刘涛", "蒋欣", "王子文"], "data-ranklist-yes": "281,111", "tend_line": "上升", "data-ranklist-week": "1,698,263", "tend_line1": "上升"}
{"index": "35", "title": "花千骨", "actor": ["霍建华", "赵丽颖", "蒋欣"], "data-ranklist-yes": "276,161", "tend_line": "下降", "data-ranklist-week": "1,997,459", "tend_line1": "下降"}
{"index": "36", "title": "武林外传", "actor": ["闫妮", "沙溢", "姚晨"], "data-ranklist-yes": "271,923", "tend_line": "下降", "data-ranklist-week": "1,901,301", "tend_line1": "下降"}
{"index": "37", "title": "神探柯晨", "actor": ["黄志忠", "吴刚", "李倩"], "data-ranklist-yes": "245,894", "tend_line": "下降", "data-ranklist-week": "2,323,772", "tend_line1": "下降"}
{"index": "38", "title": "我是特种兵之利刃出鞘", "actor": ["吴京", "徐佳", "赵荀"], "data-ranklist-yes": "244,222", "tend_line": "下降", "data-ranklist-week": "1,831,220", "tend_line1": "下降"}
{"index": "39", "title": "芸汐传", "actor": ["鞠婧祎", "张哲瀚", "米热"], "data-ranklist-yes": "243,418", "tend_line": "下降", "data-ranklist-week": "1,755,279", "tend_line1": "下降"}
{"index": "40", "title": "哥哥姐姐的花样年华", "actor": ["王雅捷", "王挺", "周扬"], "data-ranklist-yes": "221,093", "tend_line": "下降", "data-ranklist-week": "3,032,053", "tend_line1": "下降"}
{"index": "41", "title": "奋斗吧,少年!", "actor": ["彭昱畅", "董力", "张逸杰"], "data-ranklist-yes": "220,023", "tend_line": "上升", "data-ranklist-week": "381,625", "tend_line1": "上升"}
{"index": "42", "title": "微微一笑很倾城", "actor": ["郑爽", "杨洋", "毛晓彤"], "data-ranklist-yes": "219,655", "tend_line": "下降", "data-ranklist-week": "1,605,033", "tend_line1": "下降"}
{"index": "43", "title": "三国演义", "actor": ["鲍国安", "唐国强", "孙彦军"], "data-ranklist-yes": "219,400", "tend_line": "下降", "data-ranklist-week": "1,560,102", "tend_line1": "下降"}
{"index": "44", "title": "杉杉来了", "actor": ["张翰", "赵丽颖", "黄宥明"], "data-ranklist-yes": "217,775", "tend_line": "下降", "data-ranklist-week": "1,569,643", "tend_line1": "下降"}
{"index": "45", "title": "动物管理局", "actor": ["陈赫", "王子文", "唐晓天"], "data-ranklist-yes": "213,471", "tend_line": "下降", "data-ranklist-week": "2,242,207", "tend_line1": "下降"}
{"index": "46", "title": "鸡毛飞上天", "actor": ["张译", "殷桃", "高姝瑶"], "data-ranklist-yes": "203,069", "tend_line": "下降", "data-ranklist-week": "1,744,331", "tend_line1": "下降"}
{"index": "47", "title": "都挺好", "actor": ["姚晨", "倪大红", "郭京飞"], "data-ranklist-yes": "201,766", "tend_line": "下降", "data-ranklist-week": "1,469,292", "tend_line1": "下降"}
{"index": "48", "title": "甄嬛传", "actor": ["孙俪", "陈建斌", "蔡少芬"], "data-ranklist-yes": "197,187", "tend_line": "下降", "data-ranklist-week": "1,510,209", "tend_line1": "下降"}
{"index": "49", "title": "火蓝刀锋", "actor": ["杨志刚", "郑凯", "赫子铭"], "data-ranklist-yes": "186,544", "tend_line": "平稳", "data-ranklist-week": "1,118,312", "tend_line1": "平稳"}
{"index": "50", "title": "夜空中最闪亮的星", "actor": ["黄子韬", "吴倩", "牛骏峰"], "data-ranklist-yes": "184,997", "tend_line": "平稳", "data-ranklist-week": "1,267,903", "tend_line1": "平稳"}