python - Scrapy Pipeline unknown number of results -
i have scrapy spider gets start_urls mysql database. when scrapes each page comes unknown number of links, meaning have 0 links or 10 links each page scrapes. because number unknown don't know how best have pipeline update database possible scraped links instead have dumping start_url , scraped link new database. if using new database, bring on searchterm column value each start_url in new database.
if grab searchterm column each start_url, pipe in new database or if have different idea on how update original database unknown quantity of scraped links, work well.
here spider.py. have commented out offending lines
import scrapy import mysqldb import mysqldb.cursors scrapy.http.request import request youtubephase2.items import youtubephase2item class youtubephase2(scrapy.spider): name = 'youtubephase2' def start_requests(self): conn = mysqldb.connect(user='uname', passwd='password', db='youtubescrape', host='localhost', charset="utf8", use_unicode=true) cursor = conn.cursor() cursor.execute('select * searchresults;') rows = cursor.fetchall() row in rows: if row: #yield request(row[0], self.parse, meta=dict(searchterm=row[0])) yield request(row[1], self.parse, meta=dict(start_url=row[1])) cursor.close() def parse(self, response): sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'): item = youtubephase2item() #item['searchterm'] = response.meta['searchterm'] item['start_url'] = response.meta['start_url'] item['affiliateurl'] = sel.xpath('@href').extract_first() yield item
i'm not sure if understand correctly, can carry several items in meta.
assuming have table:
# table1 id|urls | name | address | 0 |foo.com | foo | foo 1 | 1 |bar.com | bar | bar 1 |
yield request every row , parse yield many items want new table:
def start_requests(self): rows = ... row in rows: url = row[1] yield request(url, meta={'row' row}) def parse(self, response): links = ... row = response.meta['row'] link in links: item = dict() item['urls'] = row[1] item['name'] = row[2] item['address'] = row[3] # add stuff... item['link'] = link yield item
and save items database , you'll end with:
# table2 id|urls | name | address | link | 0 |foo.com | foo | foo 1 | link1 | 1 |foo.com | foo | foo 1 | link2 | 2 |foo.com | foo | foo 1 | link3 | 3 |bar.com | bar | bar 1 | link1 | 4 |bar.com | bar | bar 1 | link2 | 5 |bar.com | bar | bar 1 | link3 |
Comments
Post a Comment