1. 您的位置:首页 > seo技术 >内容









可以使用仅适用于Vanilla Python,但可以使用图书馆o使任务更容易,更快地编写和完成。









例如,如果我有一个完整的网站爬网并想要仅提取those pages that are indexable, I will use a built-in Pandas function to include only those URLs in my DataFrame.

import pandas as pd df = pd.read_csv('/Users/rutheverett/Documents/Folder/file_name.csv')df.headindexable = df[(df.indexable == True)]indexable


The next library is called Requests and is used to make HTTP requests in Python.

Requests uses different request methods such as GET and POST to make a request, with the results being stored in Python.

One example of this in action is a simple GET request of URL, this will print out the status code of a page:

import requestsresponse = requests.get('https://www.deepcrawl.com') print(response)

You can then use this result to create a decision-making function, where a 200 status code means the page is available but a 404 means the page is not found.

if response.status_code == 200: print('Success!')elif response.status_code == 404: print('Not Found.')

You can also use different requests such as headers, which display useful information about the page like the content type or how long it took to cache the response.

headers = response.headersprint(headers)response.headers['Content-Type']

There is also the ability to simulate a specific user agent, such as 站群bot, in order to extract the response this specific bot will see when crawling the page.

headers = {'User-Agent': 'Mozilla/5.0 (compatible; 站群bot/2.1; +http://www.google.com/bot.html)'} ua_response = requests.get('https://www.deepcrawl.com/', headers=headers) print(ua_response)

Beautiful Soup

Beautiful Soup is a library used to extract data from HTML and XML files.

Fun fact: The BeautifulSoup library was actually named after the poem from Alice’s Adventures in Wonderland by Lewis Carroll.

As a library, BeautifulSoup is used to make sense of web files and is most often used for web scraping, as it can transform an HTML document into different Python objects.

For example, you can take a URL and use Beautiful Soup together with the Requests library to extract the title of the page.

from bs4 import BeautifulSoup import requestsurl = 'https://www.deepcrawl.com' req = requests.get(url) soup = BeautifulSoup(req.text, "html.parser")title = soup.title print(title)

Additionally, using the find_all method, BeautifulSoup enables you to extract certain elements from a page, such as all a href links on the page:

url = 'https://www.deepcrawl.com/knowledge/technical-seo-library/' req = requests.get(url) soup = BeautifulSoup(req.text, "html.parser")for link in soup.find_all('a'): print(link.get('href'))

Putting Them Together

These three libraries can also be used together, with Requests used to make the HTTP request to the page we would like to use BeautifulSoup to extract information from.

We can then transform that raw data into a Pandas DataFrame to perform further ****ysis.

URL = 'https://www.deepcrawl.com/blog/'req = requests.get(url)soup = BeautifulSoup(req.text, "html.parser")links = soup.find_all('a')df = pd.DataFrame({'links':links})df

Matplotlib and Seaborn

Matplotlib and Seaborn are two Python libraries used for creating visualizations.

Matplotlib allows you to create a number of different data visualizations such as bar charts, line graphs, histograms, and even heatmaps.

For example, if I wanted to take some 站群 Trends data to display the queries with the most popularity over a period of 30 days, I could create a bar chart in Matplotlib to visualize all of these.

Seaborn, which is built upon Matplotlib, provides even more visualization patterns such as scatterplots, box plots, and violin plots in addition to line and bar graphs.

It differs slightly from Matplotlib as it uses fewer syntax and has built-in default themes.

One way I’ve used Seaborn is to create line graphs in order to visualize log file hits to certain segments of a website over time.

sns.lineplot(x = "month", y = "log_requests_total", hue='category', data=pivot_status)plt.show()

This particular example takes data from a pivot table, which I was able to create in Python using the Pandas library, and is another way these libraries work together to create an easy-to-understand picture from the data.


Advertools is a library created by Elias Dabbas that can be used to help manage, understand, and make decisions based on the data we have as SEO professionals and digital marketers.

Sitemap Analysis

This library allows you to perform a number of different tasks such as downloading, parsing, and ****yzing XML Sitemaps to extract patterns or ****yze how often content is added or changed.

Robots.txt Analysis

Another interesting thing you can do with this library is to use a function to extract a website’s robots.txt into a DataFrame, in order to easily understand and ****yze the rules set.

You can also run a test within the library in order to check whether a particULAR用户代理能够获取某些URL或文件夹路径。









每个浏览器都有自己的webdriver; Chrome有染色体,例如Firefox有Geckodriver。








Scrapy C.an be used to extract all of the links on a certain page and store them in an output file, for example.

class SuperSpider(CrawlSpider): name = 'extractor' allowed_domains = ['www.deepcrawl.com'] start_urls = ['https://www.deepcrawl.com/knowledge/technical-seo-library/'] base_url = 'https://www.deepcrawl.com' def parse(self, response): for link in response.xpath('//div/p/a'): yield { "link": self.base_url + link.xpath('.//@href').get() }

You can take this one step further and follow the links found on a webpage to extract information from all the pages which are being linked to from the start URL, kind of like a **all-scale replication of 站群 finding and following links on a page.

from scrapy.spiders import CrawlSpider, Rule class SuperSpider(CrawlSpider): name = 'follower' allowed_domains = ['en.*********.org'] start_urls = ['https://enseo是什么意思.*********.org/wiki/Web_scraping'] base_url = 'https://en.*********.org' custom_settings = { 'DEPTH_LIMIT': 1 } def parse(self, response): for next_page in response.xpath('.//div/p/a'): yield response.follow(next_page, self.parse) for quote in response.xpath('.//h1/text()'): yield {'quote': quote.extract() }

Learn more about these projects, among other example projects, here.

Final Thoughts

As Hamlet Batista always said, “the best way to learn is by doing.”

I hope that discovering some of the libraries available has inspired you to get started with learning Python, or to deepen your knowledge.

Python Contributions from the SEO Industry

Hamlet also loved sharing resources and projects from those in the Python SEO community. To honor his passion for encouraging others, I wanted to share some of the amazing things I have seen from the community.

As a wonderful tribute to Hamlet and the SEO Python community帮助培养,Charly Wargnier创造了SEO Pythonistas,为SEO社区创造了惊人的Python项目的贡献。


Moshe Ma-Yafit为日志文件分析创建了一个超酷的脚本,在此帖子中介绍了脚本的工作原理。它能够通过设备显示(响应代码)每日命中,响应代码%总数等,可以显示包括站群机器人的可视化。

KorayTużberkGübür目前正在研究SiteMap Health Checker。他还举办了一个Ranksense网络研讨会,elias dabbas,他分享了一个唱片,它记录SERPS和分析算法。


John Mcalpin写了一篇文章,详细介绍了如何使用Python和Data Studio在竞争对手上间谍的文章。

JC Chouinard写了一个完整的指南来使用reddit api。有了这个,哟u can perform things such as extracting data from Reddit and posting to a Subreddit.

Rob May is working on a new GSC ****ysis tool and building a few new domain/real sites in Wix to measure against its higher-end WordPress compe***** while documenting it.

Masaki Okazawa also shared a script that ****yzes 站群 Search Console Data with Python.

More Resources:

How to Automate the URL Inspection Tool with Python & JavaScript

6 SEO Tasks to Automate with Python

Advanced Technical SEO: A Complete Guide

Image Credits

All screenshots taken by author, March 2021
