对绝想日志网的说说进行爬取,并存到mysql数据库
使用python3配合re、time、pyquery、mysql等库,对绝想日志网的说说进行爬取,并存到mysql数据库
Python Web爬虫
共0Star
详细介绍
get_juexiang.com_python3
使用python3配合re、time、pyquery、mysql等库,对绝想日志网的说说进行爬取,并存到mysql数据库
###过滤html标签 def stripTag(x): return re.sub(r'<(.*?)>','',str(x))
###转换时间戳 def timeStamp(x): return time.mktime(time.strptime(x,'%Y-%m-%d %H:%M'))
###获取网页局部源码 d = pq(url='http://www.juexiang.com/list/1017') d = pq(d('.left').html()) x = d('div.arttitle')
###匹配时间格式 pattern = re.compile(r"[0-9]{4}(.*)[0-9]{2}")
###采集网页信息 def get_content(x): a = pq(pq(x).html()) title = stripTag(pq(a('a').eq(0).text())) author = stripTag(pq(a('a').eq(1).text())) time1 = str(pq(a('span').eq(2).text())) time1 = timeStamp((pattern.search(time1)).group()) return title,author,time1
###连接mysql数据库 def connection(): config = { 'user': '', 'password': '', 'host': '', "port": 3306, 'database': '' } try: c = connector.connect(**config) return c except: print("connection error") exit(1)
cn = connection()
cur = cn.cursor()
###for循环获取标题、作者、时间 for i in x: title,author,time1 = get_content(i) data = (title,author,time1,3,1) print(data)