python 软件之间同花顺 python控制同花顺

转载

mob64ca1404ed65 2024-02-04 22:49:50

文章标签 python 软件之间同花顺 html hg 数据 文章分类 Python 后端开发

说明：所有内容均作为学习用途

一.功能描述

1.获取上交所所有股票的名称和交易信息；

2.保存到文件中

3.技术路线 requests-bs4-re

二.候选网站数据选择

1.股票信息静态存在于HTML页面中，非js代码生成；无robots协议限制；

2.选取方法：源代码查看，例如本案例选取的同花顺，右键网页查看源代码，复制关键词，如中国平安，可在源代码中找到，此为代码写在HTML中；

python 软件之间同花顺 python控制同花顺_python 软件之间同花顺

候选网站：同花顺，作为获取沪交所股票代码编号；集思录获取具体的股票信息。

目标获取的信息如下：

python 软件之间同花顺 python控制同花顺_python 软件之间同花顺_02

三.程序结构的设计

1. 从东方财富网获取股票列表

2. 根据股票列表逐个到百度股票获取个股信息

3. 将结果存储到文件

python 软件之间同花顺 python控制同花顺_html_03

源代码分析：只要获取到 a 标签，然后获取到其中链接的股票代码即可（使用正则表达式\d{6}）

这一部代码：

def getStockList(lst,stockURL):
    # 在同花顺网站获取股票编码
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html,'lxml')
    a = soup.find_all('a')
    for i in a:
        try:
           href = i.attrs['href']
            if re.findall(r'(?<=/)\d{6}(?=/)',href)[0] not in lst:
                lst.append(re.findall(r'(?<=/)\d{6}(?=/)',href)[0])
        except:
            continue
    print(lst) # 股票编码

接着根据获取的目标股票代码，https://www.jisilu.cn/data/stock/601318，其最后的数字为股票代码；

python 软件之间同花顺 python控制同花顺_数据_04

获取响应的html代码片段

python 软件之间同花顺 python控制同花顺_数据_05

def getStockInfo(lst,stockURL,fpath):
    for stock in lst:
        # 根据在同花顺网站获取的股票编码+集思录url获取某个股票的具体信息
       url = stockURL + stock;
       html = getHTMLText(url)
       try:
           if html=="":
                continue
           infoDict = {}
           soup = BeautifulSoup(html,'lxml')
           divList = soup.find('div',attrs={'class':'grid data_content'})
           count = 0
           mystr = ""
           for div in divList:
               # print(div)
               mystr = mystr + str(div)
               count += 1
               if count >= 8:
                   break
           # 这一部分就是仅获取图片中的那一部分的数据，为什么是8？我是一点一点数字放大打印出来的
           mystr = mystr.replace('<td> </td>', '')# 将代码中的空格替换掉
           mystr = mystr.replace('<td style="width:  20px;"> </td>','') # 同上
           myhtml = BeautifulSoup(mystr,'html.parser')

           name = myhtml.find_all('li',attrs={'class':'active'})[0]
           infoDict.update({'股票名称':name.text.split()})
           keylist = myhtml.find_all('td')
           keykeylist = []
           for key in keylist:
               strkey = str(key)
               strkey = strkey.replace('\n','')
               match = re.search(r'(?<=\>).*?(?=\<span)',strkey) or re.search(r'(?<=\>).*?(?=\<sup)',strkey)
               # 将<td title="5年平均股息率：-">股息率<sup>TTM</sup><span class="dc">15.066%</span>和 
               # <td style="width: 150px;">现价 <span class="dc">5.310</span></td>
               # 将关键字 现价  和 股息率 提取出来
               if match:
                   keykeylist.append(match.group(0).replace(' ','').replace('<sup>TTM</sup>',''))
			# 将关键字中的空格换行，其他没有信息替换掉
           valuelist = myhtml.find_all('span',attrs={'class':'dc'})

           for i in range(len(valuelist)):
               key = keykeylist[i]
               val = valuelist[i].text
               infoDict[key] = val
           #print(infoDict)

           with open(fpath,'a',encoding='utf-8') as f:
               f.write(str(infoDict)+'\n')
       except:
           traceback.print_exc()
           continue

获取html文本的代码

def getHTMLText(url):
    try:
        headers = {'User-Agent': 'Your-Agent',
                   'Cookie':'Your-cookie'}
        r = requests.get(url,headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

其中header中的数据来源和原因可见链接

最后代码的主体部分

import requests
from bs4 import BeautifulSoup
import traceback
import re
def main():
    stock_list_url_hg = "http://data.10jqka.com.cn/hgt/hgtb/"  #沪股
    stock_info_url = "https://www.jisilu.cn/data/stock/"
    output_file = "D:\\self_taught\\python\\WebCrawler\\BaiduStockInfo.txt"
    slist = []
    getStockList(slist,stock_list_url_hg)
    getStockInfo(slist,stock_info_url,output_file)
main()