1. 您的位置:首页 > seo技术 >内容

seo808论坛 如何使用Python分析SEO数据:参考指南

您是否发现自己每天都在执行相同的重复SEO任务,或者在没有工具可以帮助您的情况下面临挑战?

如果是这样,可能是您学习Python的时候了。

最初的时间和精力投入将显着提高生产力。

虽然我主要是为刚接触编程的SEO专业人员写这篇文章的,但我希望它对那些已经具有软件或Python背景但正在寻找易于扫描的参考资料的人有用。用于数据分析项目。

目录

Python基础

数据提取

基本分析

保存和导出结果

了解更多的资源

Python基础知识

Python很容易学习,我建议您花一个下午的时间浏览官方教程。我将重点介绍SEO的实际应用。

在编写Python程序时,您可以在Python 2还是Python 3之间进行选择。最好使用Python 3编写新程序,但是可能您的系统可能会出现这种情况。已安装Python 2的情况,尤其是在使用Mac的情况下。请同时安装Python 3才能使用该备忘单。

您可以使用以下命令检查Python版本:

$ python-版本

使用虚拟环境

当您完成工作时,确保社区中的其他人可以重现您的结果非常重要。他们将需要能够安装与您使用的相同的第三方库,通常使用完全相同的版本。

Python鼓励为此创建虚拟环境。

如果您的系统随附Python 2,请使用Anaconda发行版下载并安装Python 3,然后在命令行中运行以下步骤。

$ sudo easy_install pip

$ sudo pip安装virtualenv

$ mkdir seowork

$ virtualenv -p python3搜索工作

如果您已经在使用Python 3,请在命令行中运行以下替代步骤:

$ mkdir seowork

$ python3 -m venv seowork

后续步骤允许在任何Python版本中工作,并允许您使用虚拟环境。

$ cd seowork

$来源装箱/激活

(seowork)$停用

停用环境后,您将返回命令行,并且在环境中安装的库将不起作用。

有用的数据分析库

每当我启动数据分析项目时,我都希望至少安装以下库:

要求。

Matplotlib。

Requests-html。

熊猫

其中大多数都包含在Anaconda发行版中。让我们将它们添加到我们的虚拟环境中。

(seowork)$ pip3安装请求

(seowork)$ pip3安装matplotlib

(seowork)$ pip3安装请求-html

(seowork)$ pip3安装熊猫

您可以像下面这样在代码的开头导入它们:

汇入要求

从request_html导入HTMLSession

将熊猫作为pd导入

由于在程序中需要更多的第三方库,因此需要一种简便的方法来跟踪它们并帮助其他人轻松地设置脚本。

您可以使用以下命令导出虚拟环境中安装的所有库(及其版本号):

(seowork)$ pip3冻结> requirements.txt

当您与项目共享此文件时,社区中的其他任何人都可以使用此简单命令在自己的虚拟环境中安装所有必需的库:

(peer-seowork)$ pip3 install -r requirements.txt

使用Jupyter笔记本

在进行数据分析时,我更喜欢使用Jupyter笔记本,因为它们提供了比命令行更方便的环境。您可以检查正在使用的数据并以探索性方式编写程序。

(seowork)$ pip3安装jupyter

然后,您可以使用以下命令运行笔记本:

(seowork)$ jupyter笔记本

您将获得要在浏览器中打开的URL。

另外,您可以使用站群合作社,它是站群文档的一部分,不需要进行任何设置。

字符串格式

您将在程序中花费大量时间,为输入不同功能的字符串做准备。有时,您需要合并来自不同来源的数据,或从一种格式转换为另一种格式。

假设您要以编程方式获取站群Analytics(分析)数据。你约n build an API URL using 站群 Analytics Query Explorer, and replace the parameter values to the API with placeholders using brackets. For example:

api_uri = "https://www.googleapis.com/****ytics/v3/data/ga?ids={gaid}&"\

"start-date={start}&end-date={end}&metrics={metrics}&"\

"dimensions={dimensions}&segment={segment}&access_token={token}&"\

"max-results={max_results}"

{gaid} is the 站群 account, i.e., “ga:12345678”

{start} is the start date, i.e., “2017-06-01”

{end} is the end date, i.e., “2018-06-30”

{metrics} is for the list of numeric parameters, i.e., “ga:users”, “ga:newUsers”

{dimensions} is the list of categorical parameters, i.e., “ga:landingPagePath”, “ga:date”seo808论坛

{segment} is the marketing segments. For SEO we want Organic Search, which is “gaid::-5”

{token} is the security access token you get from 站群 Analytics Query Explorer. It expires after an hour, and you need to run the query again (while authenticated) to get a new one.

{max_results} is the maximum number of results to get back up to 10,000 rows.

You can define Python variables to hold all these parameters. For example:

gaid = "ga:12345678"

start = "2017-06-01"

end = "2018-06-30"

This allows you to fetch data from multiple websites or data ranges fairly easily.

Finally, you can combine the parameters with the API URL to produce a valid API request to call.

api_uri = api_uri.format(

gaid=gaid,

start=start,

end=end,

metrics=metrics,

dimensions=dimensions,

segment=segment,

token=token,

max_results=max_results

)

Python will replace each place holder with its corresponding value from the variables we are passing.

String Encoding

Encoding is another common string manipulation technique. Many APIs require strings formatted in a certain way.

For example, if one of your parameters is an absolute URL, you need to encode it before you insert it into the API string with placeholders.

from urllib import parse

url="https://www.searchenginejournal.com/"

parse.quote(url)

The output will look like this:

‘https%3A//www.searchenginejournal.com/’

which would be safe to pass to an API request.

Another example: say you want to generate title tags that include an ampersand (&) or angle brackets (<, >). Those need to be escaped to avoid confusing HTML parsers.

import html

title= "SEO "

html.escape(title)

The output will look like this:

'SEO &lt;News &amp; Tutorials&gt;'

Similarly, if you read data that is encoded, you can revert it back.

html.unescape(escaped_title)

The output will read again like the original.

Date Formatting

It is very common to ****yze time series data, and the date and time stamp values can come in many different formats. Python supports converting from dates to strings and back.

For example, after we get results from the 站群 Analytics API, we might want to parse the dates into datetime objects. This will make it easy to sort them or convert them from one string format to another.

from datetime import datetime

dt = datetime.strptime('Jan 5 2018 6:33PM', '%b %d %Y %I:%M%p')

Here %b, %d, etc are directives supported by strptime (used when reading dates) and strftime (used when writing them). You can find the full reference here.

Making API Requests

Now that we know how to format strings and build correct API requests, let see how we actually perform such requests.

r = requests.get(api_uri)

We can check the response to make sure we have valid data.

print(r.status_code)

print(r.headers['content-type'])

You should see a 200 status code. The content type of most APIs is generally JSON.

When you are checking redirect chains, you can use the redirect history parameter to see the full chain.

print(r.history)

In order to get the final URL, use:

print(r.url)

Data Extraction

A big part of your work is procuring the data you need to perform your ****ysis. The data will be available from different sources and formats. Let’s explore the most common.

Reading from JSON

Most APIs will return results in JSON format. We need to parse the data in this format into Python dictionaries. You can use the standard JSON library to do this.

import json

json_response = '{"website_name": "Search Engine Journal", "website_url":"https://www.searchenginejournal.com/"}'

parsed_response = json.loads(json_response)

Now you can easily access any data you need. For example:

print(parsed_response["website_name"])

The output would be:

"Search Engine Journal"

When you use the requests library to perform API calls, you don’t need to do this. The response object provides a convenient property for this.

parsed_response=r.json()

Reading from HTML Pages

Most of the data we need for SEO is going to be on client websites. While there is no shortage of awesome SEO crawlers, it is important to learn how to crawl yourself to do fancy stuff like automatically grouping pages by page types.

from requests_html import HTMLSession

session = HTMLSession()

r = session.get('https://www.searchenginejournal.com/')

You can get all absolute links using this:

print(r.html.absolute_links)

The partial output would look like this:

{'http://jobs.searchenginejournal.com/', 'https://www.searchenginejournal.com/what-i-learned-about-seo-this-year/281301/', …}

Next, let’s fetch some common SEO tags using XPATHs:

Page Title

r.html.xpath('//title/text()')

The output is:

['Search Engine Journal - SEO, Search Marketing News and Tutorials']

Meta Description

r.html.xpath("//meta[@name='description']/@content")

Please note that I changed the style of quotes from single to double or I’d get an coding error.

The output is:

['Search Engine Journal is dedicated to producing the latest search news, the best guides and how-tos for the SEO and marketer community.']

Canonical

r.html.xpath("//link[@rel='canonical']/@href")

The output is:

['https://www.searchenginejournal.com/']

AMP URL

r.html.xpath("//link[@rel='amphtml']/@href")

Search Engine Journal doesn’t have an AMP URL.

Meta Robots

r.html.xpath("//meta[@name='ROBOTS']/@content")

The output is:

['NOODP']

H1s

r.html.xpath("//h1")

The Search Engine Journal home page doesn’t have h1s.

HREFLANG Attribute Values

r.html.xpath("//link[@rel='alternate']/@hreflang")

Search Engine Journal doesn’t have hreflang attributes.

站群 Site Verification

r.html.xpath("//meta[@name='google-site-verification']/@content")

The output is:

['NcZlh5TFoRGYNheLXgxcx9gbVcKCsHDFnrRyEUkQswY',

'd0L0giSu_RtW_hg8i6GRzu68N3d4e7nmPlZNA9sCc5s',

'S-Orml3wOAaAplw**19igpEZzRibTtnctYrg46pGTzA']

JavaScript Rendering

If the page you are ****yzing needs JavaScript rendering, you only need to add an extra line of code to support this.

from requests_html import HTMLSession

session = HTMLSession()

r = session.get('https://www.searchenginejournal.com/')

r.html.render()

The first time you run render() will take a while because Chromium will be downloaded. Rendering Javascript is much slower than without rendering.

Reading from XHR requests

As rendering JavaScript is slow and time consuming, you can use this alternative approach for websites that load JavaScript content using AJAX requests.

Screenshot showing how to check the request headers of a JSON file using Chrome Developer tools. The path of the JSON file is highlighted, as is the x-requested-with header.

ajax_request='https://www.searchenginejournal.com/location.json'

r = requests.get(ajax_request)

results=r.json()

You will get the data you need faster as there is no JavaScript rendering or even HTML parsing involved.

Reading from Server Logs

站群 Analytics is powerful but doesn’t record or present visits from most search engine crawlers. We can get that information directly from server log files.

Let’s see how we can ****yze server log files using regular expressions in Python. You can check the regex that I’m using here.

import re

log_line='66.249.66.1 - - [06/Jan/2019:14:04:19 +0200] "GET / HTTP/1.1" 200 - "" "Mozilla/5.0 (compatible; 站群bot/2.1; +http://www.google.com/bot.html)"'

regex='([(\d\.)]+) - - \[(.*?)\] \"(.*?)\" (\d+) - \"(.*?)\" \"(.*?)\"'

groups=re.match(regex, line).groups()

print(groups)

The output breaks up each element of the log entry nicely:

('66.249.66.1', '06/Jan/2019:14:04:19 +0200', 'GET / HTTP/1.1', '200', '', 'Mozilla/5.0 (compatible; 站群bot/2.1; +http://www.google.com/bot.html)')

You access the user agent string in group six, but lists in Python start at zero, so it is five.

print(groups[5])

The output is:

'Mozilla/5.0 (compatible; 站群bot/2.1; +http://www.google.com/bot.html)'

You can learn about regular expression in Python here. Make sure to check the section about greedy vs. non-greedy expressions. I’m using non-greedy when creating the groups.

Verifying 站群bot

When performing log ****ysis to understand search bot behavior, it is important to exclude any fake requests, as anyone can pretend to be 站群bot by changing the user agent string.

站群 provides a simple approach to do this here. Let’s see how to automate it with Python.

import socket

bot_ip = "66.249.66.1"

host = socket.gethostbyaddr(bot_ip)

print(host[0])

You will get

crawl-66-249-66-1.googlebot.com

ip = socket.gethostbyname(host[0])

You get

‘66.249.66.1’

, which shows that we have a real 站群bot IP as it matches our original IP we extracted from the server log.

Reading from URLs

An often-overlooked source of information is the actual webpage URLs. Most websites and content management systems include rich information in URLs. Let’s see how we can extract that.

It is possible to break URLs into their components using regular expressions, but it is much simpler and robust to use the standard library

urllib

for this.

from urllib.parse import urlparse

url="https://www.searchenginejournal.com/?s=google&search-orderby=relevance&searchfilter=0&search-date-from=January+1%2C+2016&search-date-to=January+7%2C+2019"

parsed_url=urlparse(url)

print(parsed_url)

The output is:

ParseResult(scheme='https', netloc='www.searchenginejournal.com', path='/', params='', query='s=google&search-orderby=relevance&searchfilter=0&search-date-from=January+1%2C+2016&search-date-to=January+7%2C+2019', fragment='')

For example, you can easily get the domain name and directory path using:

print(parsed_url.netloc)

print(parsed_url.path)

This would output what you would expect.

We can further break down the query string to get the url parameters and their values.

parsed_query=parse_qs(parsed_url.query)

print(parsed_query)

You get a Python dictionary as output.

{'s': ['google'],

'search-date-from': ['January 1, 2016'],

'search-date-to': ['January 7, 2019'],

'search-orderby': ['relevance'],

'searchfilter': ['0']}

We can continue and parse the date strings into Python datetime objects, which would allow you to perform date operations like calculating the number of days between the range. I will leave that as an exercise for you.

Another common technique to use in your ****ysis is to break the path portion of the URL by ‘/’ to get the parts. This is simple to do with the split function.

url="https://www.searchenginejournal.com/category/digital-experience/"

parsed_url=urlparse(url)

parsed_url.path.split("/")

The output would be:

['', 'category', 'digital-experience', '']

When you split URL paths this way, you can use this to group a large group of URLs by their top directories.

For example, you can find all products and all categories on an ecommerce website when the URL structure allows for this.

Performing Basic Analysis

You will spend most of your time getting the data into the right format for ****ysis. The ****ysis part is relatively straightforward, provided you know the right questions to ask.

Let’s start by loading a Screaming Frog crawl into a pandas dataframe.

import pandas as pd

df = pd.DataFrame(pd.read_csv('internal_all.csv', header=1, parse_dates=['Last Modified']))

print(df.dtypes)

The output shows all the columns available in the Screaming Frog file, and their Python types. I asked pandas to parse the Last Modified column into a Python datetime object.

Let’s perform some example ****yses.

Grouping by Top Level Directory

First, let’s create a new column with the type of pages by splitting the path of the URLs and extracting the first directory name.

df['Page Type']=df['Address'].apply(lambda x: urlparse(x).path.split("/")[1])

aggregated_df=df[['Page Type','Word Count']].groupby(['Page Type']).agg('sum')

print(aggregated_df)

After we create the Page Type column, we group all pages by type and by total the number of words. The output partially looks like this:

seoseo-guide 736seo-internal-links-best-practices 2012seo-keyword-audit 2104seo-risks 2435seo-tools 588seo-trends 3448seo-trends-2019 2676seo-value 1528

Grouping by Status Code

status_code_df=df[['Status Code', 'Address']].groupby(['Status Code']).agg('count')

print(status_code_df)

200 218301 6302 5

Listing Temporary Redirects

temp_redirects_df=df[df['Status Code'] == 302]['Address']

print(temp_redirects_df)

50 https://www.searchenginejournal.com/wp-content...116 https://www.searchenginejournal.com/wp-content...128 https://www.searchenginejournal.com/feed154 https://www.searchenginejournal.com/wp-content...206 https://www.searchenginejournal.com/wp-content...

Listing Pages with No Content

no_content_df=df[(df['Status Code'] == 200) & (df['Word Count'] == 0 ) ][['Address','Word Count']]

7 https://www.searchenginejournal.com/author/cor... 09 https://www.searchenginejournal.com/author/vic... 066 https://www.searchenginejournal.com/author/ada... 070 https://www.searchenginejournal.com/author/ron... 0

Publishing Activity

Let’s see at what times of the day the most articles are published in SEJ.

lastmod = pd.DatetimeIndex(df['Last Modified'])

writing_activity_df=df.groupby([lastmod.hour])['Address'].count()

0.0 191.0 552.0 2910.0 1011.0 5418.0 719.0 3120.0 921.0 122.0 3

It is interesting to see that there are not many changes during regular working hours.

We can plot this directly from pandas.

writing_activity_df.plot.bar()

Bar plot showing the times of day that articles are published in Search Engine Journal. The bar plot was generated using Python 3.

Saving & Exporting Results

Now we get to the easy part – saving the results of our hard work.

Saving to Excel

writer = pd.ExcelWriter(no_content.xlsx')

no_content_df.to_excel(writer,'Results')

writer.save()

Saving to CSV

temporary_redirects_df.to_csv('temporary_redirects.csv')

Additional Resources to Learn More

We barely scratched the surface of what is possible when you add Python scripting to your day-to-day SEO work. Here are some links to explore further.

The official Python tutorial

Python for data science cheat sheet

An SEO guide to XPATH

10 minutes to pandas

Introduction to pandas for Excel super users

Pandas cheat sheet

More Resources:

How to Use Regex for SEO & Website Data Extraction

Understanding JavaScript Fundamentals: Your Cheat Sheet

A Complete Guide to SEO: What You Need to Know in 2019

Image Credits

Screenshot taken by author, January 2019

Bar plot generated by author, January 2019

本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。如若转载,请注明出处:http://www.botadmin.cn/sylc/658.html