def start_requests(self):
"""
In the scrapy doc there are two ways to tell scrapy where to begin to
crawl from. One is start_requests, the other is start_urls which is
shortcut to the start_requestt.
Based on my experience, it is better to use start_requests instead of
start_urls bacause in this methods you can know how the request object
are created and how request is yield. You should keep it simple and
try not to use some magic or it might confuse you.
In this project, you have no need to change code in this method, just
modify code in parse_entry_page
If you fully understatnd how scrapy work, then you are free to choose
between start_requests and start_urls.
"""
prefix = self.settings["WEB_APP_PREFIX"]
result = parse.urlparse(prefix)
base_url = parse.urlunparse(
(result.scheme, result.netloc, "", "", "", "")
)
# Generate start url from config and self.entry, when you paste the code
# to another spider you can just change self.entry and self.taskid
url = parse.urljoin(base_url, self.entry)
print(url)
request = Request(url=url, callback=self.parse_entry_page)
request.headers[
'User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
request.headers[
'Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
request.headers['Accept-Encoding'] = 'gzip, deflate, sdch'
request.headers['Accept-Language'] = 'zh-CN,zh;q=0.8,zh-TW;q=0.6'
request.headers['Connection'] = 'keep-alive'
request.headers['Host'] = '115.28.36.253:8000'
request.headers['DNT'] = 1
yield request
评论列表
文章目录