任务说明
任务主题:论文数量统计,即统计2019年全年计算机各个方向论文数量;
数据集:https://www.kaggle.com/Cornell-University/arxiv
1. 环境配置: google colab + kaggle数据集
colab 中运行脚本,导入arxiv datasset
!pip install kaggle
!mkdir -p ~/.kaggle
!cp /content/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle config set -n path -v /content
!kaggle datasets download -d Cornell-University/arxiv
1.论文数据统计
(1) 解压文件
import zipfile
datapath = '/content/datasets/Cornell-University/arxiv/arxiv.zip'
datazip = zipfile.ZipFile(datapath)
print(datazip.namelist())
print(datazip.filename)
datazip.extractall()
(2)文件包导入
import seaborn as sns
from bs4 import BeautifulSoup
import re
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
(3) 读取json数据
# read data
data = []
with open('/content/arxiv-metadata-oai-snapshot.json', 'r') as f:
for line in f:
data.append(json.loads(line))
data = pd.DataFrame(data)
data.shape
(1796911, 14)
#查看数据
data.head(1)
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | update_date | authors_parsed |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0704.0001 | Pavel Nadolsky | C. Bal'azs, E. L. Berger, P. M. Nadolsky, C.-... | Calculation of prompt diphoton production cros... | 37 pages, 15 figures; published version | Phys.Rev.D76:013009,2007 | 10.1103/PhysRevD.76.013009 | ANL-HEP-PR-07-12 | hep-ph | None | A fully differential calculation in perturba... | [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... | 2008-11-26 | [[Balázs, C., ], [Berger, E. L., ], [Nadolsky,... |
列描述
编号 | 列名 | 描述 |
---|---|---|
0 | id | arXiv ID,可用于访问论文; |
1 | submitter | 论文提交者; |
2 | authors | 论文作者; |
3 | title | 论文标题; |
4 | comments | 论文页数和图表等其他信息; |
5 | journal-ref | 论文发表的期刊的信息; |
6 | doi | 数字对象标识符,https://www.doi.org; |
7 | report-no | 报告编号; |
8 | categories | 论文在 arXiv 系统的所属类别或标签; |
9 | license | 文章的许可证; |
10 | abstract | 论文摘要; |
12 | versions | 论文版本; |
13 | authors_parsed | 作者的信息; |
(4) 数据预处理
'''
count: 一列数据的元素个数
unique: 一列数据中元素的种类
top: 一列数据中出现频率最高的元素
freq: 一列数据中出现频率最高的元素的个数
'''
# 查看 categories
data['categories'].describe()
'''
output:
count 1796911
unique 62055
top astro-ph
freq 86914
Name: categories, dtype: object
'''
有1796911个数据, 62055个种类,出现最多的类别是astro-ph,出现86914次
# 本数据集中出现了多少独立的数据集
unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
len(unique_categories)
unique_categories
'ao-sci',
'astro-ph', 'astro-ph.CO', 'astro-ph.EP', 'astro-ph.GA', 'astro-ph.HE', 'astro-ph.IM', 'astro-ph.SR',
'atom-ph',
'bayes-an',
'chao-dyn',
'chem-ph',
'cmp-lg',
'comp-gas',
'cond-mat', 'cond-mat.dis-nn', 'cond-mat.mes-hall', 'cond-mat.mtrl-sci', 'cond-mat.other', 'cond-mat.quant-gas',
'cond-mat.soft', 'cond-mat.stat-mech', 'cond-mat.str-el', 'cond-mat.supr-con',
'cs.AI', 'cs.AR','cs.CC', 'cs.CE', 'cs.CG', 'cs.CL','cs.CR', 'cs.CV', 'cs.CY', 'cs.DB','cs.DC', 'cs.DL','cs.DM','cs.DS','cs.ET', 'cs.FL', 'cs.GL','cs.GR','cs.GT','cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS','cs.PF', 'cs.PL','cs.RO','cs.SC','cs.SD','cs.SE','cs.SI', 'cs.SY',
'dg-ga',
'econ.EM','econ.GN', 'econ.TH',
'eess.AS', 'eess.IV', 'eess.SP', 'eess.SY',
'funct-an',
'gr-qc',
'hep-ex',
'hep-lat',
'hep-ph',
'hep-th',
'math-ph',
'math.AC', 'math.AG', 'math.AP', 'math.AT', 'math.CA', 'math.CO', 'math.CT', 'math.CV', 'math.DG','math.DS', 'math.FA', 'math.GM', 'math.GN', 'math.GR', 'math.GT', 'math.HO', 'math.IT', 'math.KT', 'math.LO', 'math.MG', 'math.MP', 'math.NA', 'math.NT', 'math.OA', 'math.OC', 'math.PR', 'math.QA', 'math.RA', 'math.RT', 'math.SG', 'math.SP', 'math.ST',
'mtrl-th',
'nlin.AO','nlin.CD', 'nlin.CG', 'nlin.PS', 'nlin.SI',
'nucl-ex','nucl-th',
'patt-sol',
'physics.acc-ph', 'physics.ao-ph', 'physics.app-ph', 'physics.atm-clus', 'physics.atom-ph', 'physics.bio-ph', 'physics.chem-ph', 'physics.class-ph', 'physics.comp-ph', 'physics.data-an', 'physics.ed-ph', 'physics.flu-dyn', 'physics.gen-ph', 'physics.geo-ph', 'physics.hist-ph', 'physics.ins-det', 'physics.med-ph', 'physics.optics', 'physics.plasm-ph', 'physics.pop-ph', 'physics.soc-ph','physics.space-ph',
'plasm-ph',
'q-alg',
'q-bio','q-bio.BM','q-bio.CB', 'q-bio.GN','q-bio.MN', 'q-bio.NC','q-bio.OT','q-bio.PE', 'q-bio.QM','q-bio.SC', 'q-bio.TO',
'q-fin.CP','q-fin.EC','q-fin.GN','q-fin.MF','q-fin.PM', 'q-fin.PR', 'q-fin.RM', 'q-fin.ST', 'q-fin.TR',
'quant-ph',
'solv-int',
'stat.AP', 'stat.CO','stat.ME','stat.ML','stat.OT', 'stat.TH',
'supr-con'}```
print(len(unique_categories))
# 对2019年以后的paper完成分析,
data['year'] = pd.to_datetime(data["update_date"]).dt.year # update_date 从str变成datetime格式,并提取year
del data["update_date"]
data = data[data["year"] >= 2019]
data.reset_index(drop = True, inplace = True) # 重新编号
data
395123 rows × 14 columns
# 2019年以后,计算机领域的数据
website_url = requests.get('https://arxiv.org/category_taxonomy').text # 获取网页的文本数据
soup = BeautifulSoup(website_url, 'lxml') # 爬取是数据,使用lxml解析,加速
print(website_url)
root = soup.find('div',{'id':'category_taxonomy_list'})
tags = root.find_all(["h2","h3","h4","p"],recursive = True) #读取tags
print(tags)
# 初始化 str 和 list变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []
# ing
for t in tags:
if t.name == "h2":
level_1_name = t.text
level_2_code = t.text
level_2_name = t.text
elif t.name == "h3":
raw = t.text
# 正则表达式 '.'表示匹配任意1个字符,‘*’表示匹配表示前一个字符出现0次、多次或者无限次。
# "\(" 匹配(.
# (.*) 为括号前所有的str,\((.*)\), 为后面括号的str/
level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) # 括号里的文本
level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw) # 括号前的文本
elif t.name == "h4":
raw = t.text
level_3_code = re.sub(r"(.*) \((.*)\)",r"\1", raw)
level_3_name = re.sub(r"(.*) \((.*)\)",r"\2", raw)
elif t.name == "p":
notes = t.text
level_1_names.append(level_1_name)
level_2_names.append(level_2_name)
level_2_codes.append(level_2_code)
level_3_names.append(level_3_name)
level_3_codes.append(level_3_code)
level_3_notes.append(notes)
根据以上信息生成dataframe 格式对的数据
df_taxonomy = pd.DataFrame({
'group_name':level_1_names,
'archive_name':level_2_names,
'archive_id':level_2_codes,
'category_name':level_3_names,
'categories':level_3_codes,
'category_description':level_3_notes
})
df_taxonomy.groupby(["group_name", "archive_name"])
df_taxonomy
No. | group_name | archive_name | archive_id | category_name | categories | category_description |
---|---|---|---|---|---|---|
0 | Computer Science | Computer Science | Computer Science | Artificial Intelligence | cs.AI | Covers all areas of AI except Vision, Robotics... |
1 | Computer Science | Computer Science | Computer Science | Hardware Architecture | cs.AR | Covers systems organization and hardware archi... |
2 | Computer Science | Computer Science | Computer Science | Computational Complexity | cs.CC | Covers models of computation, complexity class... |
3 | Computer Science | Computer Science | Computer Science | Computational Engineering, Finance, and Science | cs.CE | Covers applications of computer science to the... |
4 | Computer Science | Computer Science | Computer Science | Computational Geometry | cs.CG | Roughly includes material in ACM Subject Class... |
... | ... | ... | ... | ... | ... | ... |
153 | Statistics | Statistics | Statistics | Other Statistics | stat.OT | Work in statistics that does not fit into the ... |
154 | Statistics | Statistics | Statistics | Statistics Theory | stat.TH | stat.TH is an alias for math.ST. Asymptotics, ... |
155 rows × 6 columns
数据可视化
_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()
_df
# 使用饼图对结果可视化
fig = plt.figure(figsize = (15,12))
# explode 每一块距离中心的距离
explode = (0,0,0,0.2,0.3,0.3,0.2,0.1)
plt.pie(_df["id"], labels = _df["group_name"], autopct="%1.2f%%", startangle = 160, explode=explode)
plt.tight_layout()
plt.show()
查看2019、2020论文数量
group_name="Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id")
category_name | 2019 | 2020 |
---|---|---|
Artificial Intelligence | 558 | 757 |
Computation and Language | 2153 | 2906 |
Computational Complexity | 131 | 188 |
Computational Engineering, Finance, and Science | 108 | 205 |
Computational Geometry | 199 | 216 |
Computer Science and Game Theory | 281 | 323 |
Computer Vision and Pattern Recognition | 5559 | 6517 |
Computers and Society | 346 | 564 |
Cryptography and Security | 1067 | 1238 |
Data Structures and Algorithms | 711 | 902 |
Databases | 282 | 342 |
Digital Libraries | 125 | 157 |
Discrete Mathematics | 84 | 81 |
Distributed, Parallel, and Cluster Computing | 715 | 774 |
Emerging Technologies | 101 | 84 |
Formal Languages and Automata Theory | 152 | 137 |
General Literature | 5 | 5 |
Graphics | 116 | 151 |
Hardware Architecture | 95 | 159 |
Human-Computer Interaction | 420 | 580 |
Information Retrieval | 245 | 331 |
Logic in Computer Science | 470 | 504 |
Machine Learning | 177 | 538 |
Mathematical Software | 27 | 45 |
Multiagent Systems | 85 | 90 |
Multimedia | 76 | 66 |
Networking and Internet Architecture | 864 | 783 |
Neural and Evolutionary Computing | 235 | 279 |
Numerical Analysis | 40 | 11 |
Operating Systems | 36 | 33 |
Other Computer Science | 67 | 69 |
Performance | 45 | 51 |
Programming Languages | 268 | 294 |
Robotics | 917 | 1298 |
Social and Information Networks | 202 | 325 |
Software Engineering | 659 | 804 |
Sound | 7 | 4 |
Symbolic Computation | 44 | 36 |
Systems and Control | 415 | 133 |