python - Scrapy Pipeline unknown number of results -

i have scrapy spider gets start_urls mysql database. when scrapes each page comes unknown number of links, meaning have 0 links or 10 links each page scrapes. because number unknown don't know how best have pipeline update database possible scraped links instead have dumping start_url , scraped link new database. if using new database, bring on searchterm column value each start_url in new database.

if grab searchterm column each start_url, pipe in new database or if have different idea on how update original database unknown quantity of scraped links, work well.

here spider.py. have commented out offending lines

import scrapy import mysqldb import mysqldb.cursors scrapy.http.request import request  youtubephase2.items import youtubephase2item  class youtubephase2(scrapy.spider):     name = 'youtubephase2'      def start_requests(self):         conn = mysqldb.connect(user='uname', passwd='password', db='youtubescrape', host='localhost', charset="utf8", use_unicode=true)         cursor = conn.cursor()         cursor.execute('select * searchresults;')         rows = cursor.fetchall()          row in rows:             if row:                 #yield request(row[0], self.parse, meta=dict(searchterm=row[0]))                 yield request(row[1], self.parse, meta=dict(start_url=row[1]))         cursor.close()      def parse(self, response):         sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'):             item = youtubephase2item()             #item['searchterm'] = response.meta['searchterm']             item['start_url'] = response.meta['start_url']             item['affiliateurl'] = sel.xpath('@href').extract_first()             yield item

i'm not sure if understand correctly, can carry several items in meta.

assuming have table:

# table1 id|urls     | name      | address    | 0 |foo.com  | foo       | foo 1      | 1 |bar.com  | bar       | bar 1      |

yield request every row , parse yield many items want new table:

def start_requests(self):     rows = ...     row in rows:         url = row[1]         yield request(url, meta={'row' row})  def parse(self, response):     links = ...     row = response.meta['row']     link in links:         item = dict()         item['urls'] = row[1]         item['name'] = row[2]         item['address'] = row[3]         # add stuff...         item['link'] = link         yield item

and save items database , you'll end with:

# table2 id|urls     | name      | address    | link     | 0 |foo.com  | foo       | foo 1      | link1    | 1 |foo.com  | foo       | foo 1      | link2    | 2 |foo.com  | foo       | foo 1      | link3    | 3 |bar.com  | bar       | bar 1      | link1    | 4 |bar.com  | bar       | bar 1      | link2    | 5 |bar.com  | bar       | bar 1      | link3    |

Search This Blog

New Generation Education

python - Scrapy Pipeline unknown number of results -

Comments

Post a Comment

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -