get()、getall()、extract()、extract_first()区别

2022-08-16 itknight Comments 0 Comment

开篇明义：get() 、getall() 是新版本的方法，extract() 、extract_first()是旧版本的方法。

前者更好用，取不到就返回None，后者取不到就raise一个错误。

推荐使用新方法，官方文档中也都改用前者了

结论：

对于scrapy.selector.unified.SelectorList对象，getall()==extract(),get()==extract_first()

对于scrapy.selector.unified.Selector对象，getall()==extract(),get()!=extract_first()
使用scrapy shell 进行测试

scrapy shell https://www.nxcto.com/2022/08/10/%e6%8a%a5%e9%94%99%ef%bc%9aunknown-custom-element-%ef%bc%9c%e7%bb%84%e4%bb%b6%e5%90%8d%ef%bc%9e-did-you-register-the-component-correctly%e7%9a%84%e5%8e%9f%e5%9b%a0/
进行调试，通过type()方法对元素进行审查

# 查看html前200个字符，检查结果是否正常
response.text[:200]
# 检查选择器类型
type(response.xpath('//*[@id="post-43"]/div/p/img/@alt'))
type(response.xpath('//*[@id="post-43"]/div/p/img'))
# 以上返回结果：均为<class 'scrapy.selector.unified.SelectorList'>

皆是常规操作

>>> response.xpath('//*[@id="post-43"]/div/p/img/@alt').get()
'问题还原'
>>> response.xpath('//*[@id="post-43"]/div/p/img/@alt')
[<Selector xpath='//*[@id="post-43"]/div/p/img/@alt' data='问题还原'>]
>>> type(response.xpath('//*[@id="post-43"]/div/p/img/@alt'))
<class 'scrapy.selector.unified.SelectorList'>
>>> response.xpath('//*[@id="post-43"]/div/p/img/@alt').get()
'问题还原'
>>> response.xpath('//*[@id="post-43"]/div/p/img/@alt').getall()
['问题还原']
>>> response.xpath('//*[@id="post-43"]/div/p/img/@alt').extract()
['问题还原']
>>> response.xpath('//*[@id="post-43"]/div/p/img/@alt').extract_first()
'问题还原'

发现使用Selector得到的是一个SelectorList对象实例

所以get() 、getall() 、extract() 、extract_first()是SelectorList对象实例的方法

总结一下：

对于scrapy.selector.unified.SelectorList对象

get() == extract_first()

返回的是一个list，里面包含了多个string，如果只有一个string，则返回['我很孤独']这样的形式

getall() == extract()

返回的是string，list里面第一个string

extract_first()与get()有区别与Selector对象有关

>>> response.xpath('//*[@id="post-43"]/div/p/img/@alt')[0]
<Selector xpath='//*[@id="post-43"]/div/p/img/@alt' data='问题还原'>
>>> type(response.xpath('//*[@id="post-43"]/div/p/img/@alt')[0])
<class 'scrapy.selector.unified.Selector'>
>>> response.xpath('//*[@id="post-43"]/div/p/img/@alt')[0].get()
'问题还原'
>>> response.xpath('//*[@id="post-43"]/div/p/img/@alt')[0].getall()
['问题还原']
>>> response.xpath('//*[@id="post-43"]/div/p/img/@alt')[0].extract()
'问题还原'
>>> response.xpath('//*[@id="post-43"]/div/p/img/@alt')[0].extract_first()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'Selector' object has no attribute 'extract_first'

发现：对于Selector类型的对象，并不能使用extract_first()方法，而使用get()可以

我是能行CTO

因为喜欢，所以热爱！

get()、getall()、extract()、extract_first()区别

2022-08-16 itknight Comments 0 Comment

发表回复取消回复

发表回复 取消回复

发表回复取消回复