#2.1.1 Getting Started With NumPy

1. Introducing NumPy

在前两个课程中，我们使用Python中的嵌套列表来表示数据集。 Python列表在表示数据时提供了一些优势：

列表可以包含混合类型
列表可以缩小并动态增长

使用Python列表来表示和处理数据也有一些关键的缺点：

为了支持他们的灵活性，列表往往消耗大量的内存
他们努力与中型和大型数据集合（they struggle to work with medium and larger sized datasets）

尽管有许多不同的方式对编程语言进行分类，但考虑到性能的一个重要方式是低级和高级语言之间的区别。Python是一种高级编程语言，允许我们快速编写，原型和测试我们的逻辑。另一方面，C编程语言是一种低级编程语言，性能非常高，但工作流程慢得多。

NumPy是一个将Python的灵活性和易用性与C速度相结合的库。在这个任务中，我们将首先熟悉NumPy的核心数据结构，然后建立使用NumPy来处理数据集world_alcohol.csv，其中包含了每个国家人均消费的数据。

2. Creating arrays

Learn

NumPy的核心数据结构是对象 ndarray，代表N维数组。数组(array)是值的集合，类似于列表。N维是指从对象中选择各个值所需的索引数（N-dimensional refers to the number of indices needed to select individual values from the object.）。

通常将1维阵列称为向量，而2维阵列通常被称为矩阵。这两个术语都是从称为线性代数的数学分支中借用的。它们也经常用于数据科学文献，所以我们将在整个课程中使用这些单词。
为使用NumPy，我们首先需要将它导入到我们的环境中。NumPy通常使用别名np导入：

import numpy as np

我们可以使用numpy.array()函数直接从列表构造数组。要构建一个向量，我们需要传递一个列表（没有嵌套）：

vector = np.array([5, 10, 15, 20])

numpy.array()函数也接受一个列表，我们用来创建一个矩阵（有嵌套）：

matrix = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])

Instructions

从列表中创建一个向量[10，20，30]
- 将结果分配给变量vector
从列表[[5,10,15]，[20，25，30]，[35，40，45]]列表中创建一个矩阵。
- 将结果分配给变量matrix

import numpy as np
vector = np.array([10, 20, 30])
matrix = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])
print(matrix)
[[ 5 10 15]
 [20 25 30]
 [35 40 45]]

3. Array shape

Learn

数组有一定数量的元素。下面的数组有5个元素：

Paste_Image.png

矩阵代替使用行和列，这符合我们在前几个课程中对数据集的想法。下面的矩阵有3行和5列，通常称为3×5矩阵：

Paste_Image.png

了解数组包含的元素通常很有用。我们可以使用ndarray.shape属性来确定数组中有多少个元素。

vector = numpy.array([1, 2, 3, 4])
print(vector.shape)

上面的代码将导致元组(4, )。该元组表示数组向量具有一个维度，长度为4，这与我们的直觉相匹配，该向量具有4个元素。
对于矩阵，shape属性包含一个包含2个元素的元组。

matrix = numpy.array([[5, 10, 15], [20, 25, 30]])
print(matrix.shape)

上述代码将导致元组(2, 3)表示矩阵具有2行和3列。

Instructions

导入numpy并分配给别名np；
将向量的长度(shape)分配给vector_shape；
将矩阵的长度(shape)分配给matrix_shape；
使用print()函数显示vector_shape和matrix_shape。

import numpy as np
vector = np.array([10, 20, 30])
matrix = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])
vector_shape = vector.shape
matrix_shape = matrix.shape
print(vector_shape, matrix_shape)
(3,) (3, 3)

4. Using NumPy

Learn

我们可以使用numpy.genfromtxt()函数读取数据集。我们的数据集，world_alcohol.csv是一个逗号分隔值数据集。我们可以使用delimiter参数指定分隔符：

import numpy
nfl = numpy.genfromtxt("data.csv", delimiter=",")

上述代码将在名为data.csv文件的文件中读入NumPy数组。NumPy数组使用numpy.ndarray类来表示。我们将在我们的材料中引用ndarray对象作为NumPy数组。
以下是我们将使用的数据集的前几行：

Paste_Image.png

每一行规定了一个国家每一个公民在某一年内喝了多少升酒精。第一排显示，1986年越南一般人喝了多少升葡萄酒。
以下是每列表示的内容：

Year - 该行中的数据的年份。
WHO Region - 该国所在的地区。
Country - 数据所在的国家。
Beverage Types - 数据所用的饮料类型。
Display Value - 该国公民在一年中饮用的饮料类型的平均数。

Instructions

使用numpy.genfromtxt()函数将“world_alcohol.csv”读入名为world_alcohol的NumPy数组；
使用type()和print()函数显示world_alcohol的类型。

world_alcohol = np.genfromtxt('world_alcohol.csv', delimiter=',')
print(type(world_alcohol))
print(world_alcohol)
<class 'numpy.ndarray'>
[[             nan              nan              nan              nan
               nan]
 [  1.98600000e+03              nan              nan              nan
    0.00000000e+00]
 [  1.98600000e+03              nan              nan              nan
    5.00000000e-01]
 ..., 
 [  1.98600000e+03              nan              nan              nan
    2.54000000e+00]
 [  1.98700000e+03              nan              nan              nan
    0.00000000e+00]
 [  1.98600000e+03              nan              nan              nan
    5.15000000e+00]]

5. Data types

Learn

NumPy数组中的每个值必须具有相同的数据类型。NumPy数据类型与Python数据类型相似，但差别很小。您可以在这里找到完整的NumPy数据类型列表。这里有一些常见的：

bool: Boolean.
- Can be True or False.
int: Integer values.
- Can be int16, int32, or int64. The suffix 16, 32, or 64 indicates the number of bits.
float: Floating point values.
- Can be float16, float32, or float64. The suffix 16, 32, or 64 indicates how many numbers after the decimal point the number can have.
string: String values.
- Can be string or unicode, which are two different ways a computer can store text.

NumPy将在读取数据或将列表转换为数组时自动找出适当的数据类型。你可以使用dtype属性检查NumPy数组的数据类型。

numbers = np.array([1, 2, 3, 4])
numbers.dtype

因为数字只包含整数，它的数据类型是int64。

Instructions

将world_alcohol的数据类型分配给变量world_alcohol_dtype；
使用print()函数显示world_alcohol_dtype。

world_alcohol_dtype = world_alcohol.dtype
print(world_alcohol_dtype)
float64

6. Inspecting the data

NumPy代表数据集的前几行如下：

array([[             nan,              nan,              nan,              nan,              nan],
       [  1.98600000e+03,              nan,              nan,              nan,   0.00000000e+00],
       [  1.98600000e+03,              nan,              nan,              nan,   5.00000000e-01]])

我们还没有介绍几个概念，我们将深入研究：

world_alcohol中的许多项目都是nan，包括整个第一行。nan代表“不是数字”，是用于表示缺失值的数据类型；
一些数字写成1.98600000e+03。

world_alcohol的数据类型是float。因为NumPy数组中的所有值都必须具有相同的数据类型，所以NumPy尝试在读入时将所有列转换为浮点数。numpy.genfromtxt()函数将尝试猜测其创建的数组的正确数据类型。

在这种情况下，WHO Region,Country和Beverage Types实际上是字符串，不能转换为浮点数。当NumPy不能将值转换为像float或integer这样的数值数据类型时，它使用一个代表“不是数字”的特殊的nan值。当值不存在时，NumPy会分配一个na值，代表“不可用”。nan和na值是丢失数据的类型。我们将在以后的任务中更多地了解如何处理丢失的数据。

world_alcohol.csv的整个第一行是一个标题行，其中包含每列的名称。这实际上不是数据的一部分，完全由字符串。由于该字符串不能转换为适当浮动，NumPy的使用nan值来代表他们。

如果你还没有看到科学计数法之前，你可能不认识数字一样1.98600000e+03。科学记数法是凝聚大或非常精确的数字是如何非常显示的方式。我们可以代表100在科学记数法1e+02。

在这种情况下，1.98600000e+03实际上比1986长，但是NumPy默认以科学记谱法显示数值，以计算更大或更精确的数字。

7. Reading in the data correctly

Learn

当使用numpy.genfromtxt()函数读取数据时，我们可以使用参数来自定义我们想要读取数据的方式。在我们处理的时候，我们也可以指定我们要跳过标题行world_alcohol.csv。

要指定整个NumPy数组的数据类型，我们使用关键字参数dtype并将其设置为“U75”。这指定我们要读取每个值作为75字节的unicode数据类型。我们稍后会更多地了解unicode和字节，但现在只要知道这将正确地读入数据就足够了。
要在读取数据时跳过标题，我们使用skip_header参数。skip_header参数接受一个整数值，指定我们想要NumPy忽略的文件顶部的行数。

Instructions

当使用numpy.genfromtxt()读入world_alcohol.csv时：
- 使用“U75”数据类型
- 跳过数据集中的第一行
- 使用逗号分隔符。
将结果分配给world_alcohol。
-使用print()函数显示world_alcohol。

world_alcohol = np.genfromtxt('world_alcohol.csv', delimiter=',', dtype='U75', skip_header=True )
print(type(world_alcohol))
print(world_alcohol)
<class 'numpy.ndarray'>
[['1986' 'Western Pacific' 'Viet Nam' 'Wine' '0']
 ['1986' 'Americas' 'Uruguay' 'Other' '0.5']
 ['1985' 'Africa' "Cte d'Ivoire" 'Wine' '1.62']
 ..., 
 ['1986' 'Europe' 'Switzerland' 'Spirits' '2.54']
 ['1987' 'Western Pacific' 'Papua New Guinea' 'Other' '0']
 ['1986' 'Africa' 'Swaziland' 'Other' '5.15']]

8. Indexing arrays

Learn

现在数据是正确的格式，我们来学习如何探索它。我们可以像我们如何索引普通Python列表一样索引NumPy数组。以下是我们如何索引NumPy向量：

vector = np.array([5, 10, 15, 20])
print(vector[0])

上面的代码将打印向量的第一个元素5。
索引矩阵类似于列表的索引列表。以下是索引列表列表的更新：

first_item = list_of_lists[0]
first_item[2]

我们也可以把这样的符号缩小：

list_of_lists[0][2]

我们可以以类似的方式索引矩阵，但是我们将两个索引都放在方括号内。第一个索引指定数据来自哪个行，第二个索引指定数据来自哪个列：

>> matrix = np.array([
                        [5, 10, 15], 
                        [20, 25, 30]
                     ])
>> matrix[1,2]
30

在上面的代码中，当我们索引矩阵时，我们将两个索引传递到方括号中。

Instructions

将乌拉圭人饮用的酒类在1986年的人均饮用量分配给uruguay_other_1986。这是第二行和第五列。
将第三行中的国家/地区分配给third_country。Country是第三列。

uruguay_other_1986 = world_alcohol[1,4]
third_country = world_alcohol[2,2]
print(uruguay_other_1986)
print(third_country)
0.5
Cte d'Ivoire

9. Slicing arrays

Learn

我们可以使用值切片来选择数组的子集，就像我们可以使用列表一样：

>> vector = np.array([5, 10, 15, 20])
>> vector[0:3]
array([ 5, 10, 15])

像列表一样，向量切片是从第一个索引到但不包括第二个索引。矩阵切片有点复杂，有四种形式：

matrix[:,1] （第2列的所有元素）
matrix[0:3,1] （第1-3行第2列的元素）
matrix[0:4,0:3] （第1-4行第1-3列的元素）
matrix[2, 3] （第1行第2列的元素）

我们将在此屏幕中进入第一个窗体。当我们要选择一个整体，另一个元素，我们可以这样做：
（We'll dive into the first form in this screen. When we want to select one whole dimension, and an element from the other, we can do this:）

>> matrix = np.array([
                    [5, 10, 15], 
                    [20, 25, 30],
                    [35, 40, 45]
                 ])
>> matrix[:,1]
array([10, 25, 40])

这将选择所有行，但只能选择索引为1的列。冒号本身：指定应选择单个维度的整体。将冒号设为从第一个元素中选择，直到并包括最后一个元素。（This will select all of the rows, but only the column with index 1. The colon by itself : specifies that the entirety of a single dimension should be selected. Think of the colon as selecting from the first element in a dimension up to and including the last element.）

Instructions

将整个第三列从world_alcohol分配给变量变量countries。
将world_alcohol的第五列分配给变量alcohol_consumption。

countries = world_alcohol[:, 2]
alcohol_consumption = world_alcohol[:, 4]
print(countries)
print(alcohol_consumption)
['Viet Nam' 'Uruguay' "Cte d'Ivoire" ..., 'Switzerland' 'Papua New Guinea'
 'Swaziland']
['0' '0.5' '1.62' ..., '2.54' '0' '5.15']

10. Slicing one dimension

Learn

When we want to select one whole dimension, and a slice of the other, we need to use special notation:

>> matrix = np.array([
                    [5, 10, 15], 
                    [20, 25, 30],
                    [35, 40, 45]
                 ])
>> matrix[:,0:2]
array([[ 5, 10],
       [20, 25],
       [35, 40]])

We can select rows by specifying a colon in the columns area. The code below selects rows 1 and 2, and all of the columns.

>> matrix[1:3,:]
array([[20, 25, 30],
       [35, 40, 45]])

We can also select a single value along an entire dimension. The code belows selects rows 1 and 2 and column 1:

>> matrix[1:3,1]
array([25, 40])

Instructions

Assign all the rows and the first 2 columns of world_alcohol to first_two_columns
Assign the first 10 rows and the first column of world_alcohol to first_ten_years.
Assign the first 10 rows and all of the columns of world_alcohol to first_ten_rows.

first_two_columns = world_alcohol[:, 0:2]
first_ten_years   = world_alcohol[0:10, 0]
first_ten_rows    = world_alcohol[0:10, :] 
print(first_two_columns)
print(first_ten_years)
print(first_ten_rows)
[['1986' 'Western Pacific']
 ['1986' 'Americas']
 ['1985' 'Africa']
 ..., 
 ['1986' 'Europe']
 ['1987' 'Western Pacific']
 ['1986' 'Africa']]
['1986' '1986' '1985' '1986' '1987' '1987' '1987' '1985' '1986' '1984']
[['1986' 'Western Pacific' 'Viet Nam' 'Wine' '0']
 ['1986' 'Americas' 'Uruguay' 'Other' '0.5']
 ['1985' 'Africa' "Cte d'Ivoire" 'Wine' '1.62']
 ['1986' 'Americas' 'Colombia' 'Beer' '4.27']
 ['1987' 'Americas' 'Saint Kitts and Nevis' 'Beer' '1.98']
 ['1987' 'Americas' 'Guatemala' 'Other' '0']
 ['1987' 'Africa' 'Mauritius' 'Wine' '0.13']
 ['1985' 'Africa' 'Angola' 'Spirits' '0.39']
 ['1986' 'Americas' 'Antigua and Barbuda' 'Spirits' '1.55']
 ['1984' 'Africa' 'Nigeria' 'Other' '6.1']]

11. Slicing arrays

Learn

We can also slice along both dimensions simultaneously. The following code selects rows with index 1 and 2, and columns with index 0 and 1:

>> matrix = np.array([
                    [5, 10, 15], 
                    [20, 25, 30],
                    [35, 40, 45]
                 ])
>> matrix[1:3,0:2]
array([[20, 25],
       [35, 40]])

Instructions

Assign the first 20 rows of the columns at index 1 and 2 of world_alcohol to first_twenty_regions.

first_twenty_regions = world_alcohol[0:20, 1:3]
print(first_twenty_regions)
[['Western Pacific' 'Viet Nam']
 ['Americas' 'Uruguay']
 ['Africa' "Cte d'Ivoire"]
 ['Americas' 'Colombia']
 ['Americas' 'Saint Kitts and Nevis']
 ['Americas' 'Guatemala']
 ['Africa' 'Mauritius']
 ['Africa' 'Angola']
 ['Americas' 'Antigua and Barbuda']
 ['Africa' 'Nigeria']
 ['Africa' 'Botswana']
 ['Americas' 'Guatemala']
 ['Western Pacific' "Lao People's Democratic Republic"]
 ['Eastern Mediterranean' 'Afghanistan']
 ['Western Pacific' 'Viet Nam']
 ['Africa' 'Guinea-Bissau']
 ['Americas' 'Costa Rica']
 ['Africa' 'Seychelles']
 ['Europe' 'Norway']
 ['Africa' 'Kenya']]

12. Next steps

我们已经学到了NumPy库的一些基础知识，以及如何使用NumPy数组。在接下来的任务中，我们将在此基础上，确定哪个国家消费最多的酒精。

最后编辑于：2017.12.10 11:51:58

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,324评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,303评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,192评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,555评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,569评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,566评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,927评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,583评论 0赞 257
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,827评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,590评论 2赞 320
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,669评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,365评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,941评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,928评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,159评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,880评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,399评论 2赞 342

#2.1.1 Getting Started With NumPy

1. Introducing NumPy

2. Creating arrays

Learn

Instructions

3. Array shape

Learn

Instructions

4. Using NumPy

Learn

Instructions

5. Data types

Learn

Instructions

6. Inspecting the data

7. Reading in the data correctly

Learn

Instructions

8. Indexing arrays

Learn

Instructions

9. Slicing arrays

Learn

Instructions

10. Slicing one dimension

Learn

Instructions

11. Slicing arrays

Learn

Instructions

12. Next steps

推荐阅读更多精彩内容