关联分析用于发现用户购买不同的商品之间存在关联和相关联系,比如A商品和B商品存在很强的相关性,常用于实体商店或在线电商的推荐系统,例如某一客户购买A商品,那么他很有可能会购买B商品,通过大量销售数据找到经常在一起购买的商品组合,可以了解用户的购买行为,根据销售的商品推荐关联商品从而给出购买建议,寻找销售新的增长点。
一、数据来源及说明
https://tianchi.aliyun.com/dataset/dataDetail?dataId=46&userId=1
本文从数据集中选取包含了2014年11月18日至2014年12月18日之间,10000名随机用户共12256906条行为数据,数据集的每一行表示一条用户行为,共6列。
列字段包含以下:
user_id:用户身份
item_id:商品ID
behavior_type:用户行为类型(包含点击、收藏、加购物车、购买四种行为,分别用数字1、2、3、4表示)
user_geohash:地理位置(有空值)
item_category:品类ID(商品所属的品类)
time:用户行为发生的时间
数据结构如下:在本次分析中只选用user_id和item_id这两个字段
二、提出问题
1、 用户购买哪种商品次数最多
2、 用户购买的商品中,哪些商品组合关联度高
三、数据清洗和构建模型
关联分析算法常用Apriori算法和FP-growth算法
(一) Apriori算法
1、Apriori算法基本原理
Apriori算法是经典的挖掘频繁项集和关联规则的数据挖掘算法,可以从大规模数据集中寻找物品间的隐含关系。其核心思想是通过连接产生候选项及其支持度,然后通过剪枝生成频繁项集。
项集:包含0个或者多个项的集合称为项集。在购物蓝事务中,每一样商品就是一个项,一次购买行为包含了多个项,把其中的项组合起来就构成了项集
支持度计数:项集在事务中出现的次数
频繁项集:经常出现在一块的物品的集合
关联规则:暗示两种物品之间可能存在很强的关系
Support(支持度):表示同时包含 A 和 B 的事务占所有事务的比例。如果用 P(A) 表示包含 A 的事务的比例,那么 Support = P(A & B)
Confidence(可信度/置信度):表示包含 A 的事务中同时包含 B 的事务的比例,即同时包含 A 和 B 的事务占包含 A 的事务的比例。公式表达:Confidence = P(A & B)/ P(A)
Apriori算法两个重要的定律:
定律1:如果一个集合是频繁项集,则它的所有子集都是频繁项集
定律2:如果一个集合不是频繁项集,则它的所有超集都不是频繁项集
2、Apriori算法实现基本过程如下:
1) 创建初始候选集,筛选候选项1项集
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
## 方法一:
def apriori(data_set):
"""创建初始候选集,候选项1项集"""
print('创建初始候选项1项集')
c1 = set()
for items in data_set:
for item in items:
# frozenset()返回一个冻结的集合,冻结后集合不能再添加或删除任何元素
item_set = frozenset([item])
c1.add(item_set)
2) 从候选项集中选出满足最小支持度的频繁项集并计算支持度
def generate_freq_supports(data_set, item_set, min_support):
"""从候选项集中选出频繁项集并计算支持度"""
print('筛选频繁项集并计算支持度')
freq_set = set() # 保存频繁项集元素
item_count = {} # 保存元素频次,用于计算支持度
supports = {} # 保存支持度
# 如果项集中元素在数据集中则计数
for record in data_set:
for item in item_set:
# issubset()方法用于判断集合的所有元素是否都包含在指定集合中
if item.issubset(record):
if item not in item_count:
item_count[item] = 1
else:
item_count[item] += 1
data_len = float(len(data_set))
# 计算项集支持度
for item in item_count:
if (item_count[item] / data_len) >= min_support:
freq_set.add(item)
supports[item] = item_count[item] / data_len
return freq_set, supports
3) 根据频繁项集,生成新的候选项1项集
def generate_new_combinations(freq_set, k):
"""
根据频繁项集,生成新的候选项1项集
参数:频繁项集列表 freq_set 与项集元素个数 k
"""
print('生成新组合')
new_combinations = set() # 保存新组合
sets_len = len(freq_set) # 集合含有元素个数,用于遍历求得组合
freq_set_list = list(freq_set) # 集合转为列表用于索引
for i in range(sets_len):
for j in range(i + 1, sets_len):
l1 = list(freq_set_list[i])
l2 = list(freq_set_list[j])
l1.sort()
l2.sort()
# 若两个集合的前k-2个项相同时,则将两个集合合并
if l1[0:k-2] == l2[0:k-2]:
freq_item = freq_set_list[i] | freq_set_list[j]
new_combinations.add(freq_item)
return new_combinations
4) 循环生成候选集并计算其支持度
def apriori(data_set, min_support, max_len=None):
"""循环生成候选集并计算其支持度"""
print('循环生成候选集')
max_items = 2 # 初始项集元素个数
freq_sets = [] # 保存所有频繁项集
supports = {} # 保存所有支持度
# 候选项1项集
c1 = set()
for items in data_set:
for item in items:
item_set = frozenset([item])
c1.add(item_set)
# 频繁项1项集及其支持度
l1, support1 = generate_freq_supports(data_set, c1, min_support)
freq_sets.append(l1)
supports.update(support1)
if max_len is None:
max_len = float('inf')
while max_items and max_items <= max_len:
# 生成候选集
ci = generate_new_combinations(freq_sets[-1], max_items)
# 生成频繁项集和支持度
li, support = generate_freq_supports(data_set, ci, min_support)
# 如果有频繁项集则进入下个循环
if li:
freq_sets.append(li)
supports.update(support)
max_items += 1
else:
max_items = 0
return freq_sets, supports
5) 从频繁项集中挖掘关联规则,筛选满足最小可信度的关联规则
def association_rules(freq_sets, supports, min_conf):
"""生成关联规则"""
print('生成关联规则')
rules = []
max_len = len(freq_sets)
# 筛选符合规则的频繁集计算置信度,满足最小置信度的关联规则添加到列表
for k in range(max_len - 1):
for freq_set in freq_sets[k]:
for sub_set in freq_sets[k + 1]:
if freq_set.issubset(sub_set):
frq = supports[sub_set]
conf = supports[sub_set] / supports[freq_set]
rule = (freq_set, sub_set - freq_set, frq, conf)
if conf >= min_conf:
print(freq_set,"-->",sub_set - freq_set,'frq:',frq,'conf:',conf)
rules.append(rule)
return rules
这里先用测试数据测试算法是否可行,设置最小支持度为0.5,最小置信度为0.7:
if __name__ == '__main__':
# 创建测试数据
dic = {'user_id':[111,111,
112,112,112,112,
113,113,113,113,
114,114,114,114,
115,115,115,115],
'item_id':['豆奶','莴苣',
'莴苣','尿布','葡萄酒','甜菜',
'豆奶','尿布','葡萄酒','橙汁',
'莴苣','豆奶','尿布','葡萄酒',
'莴苣','豆奶','尿布','橙汁']}
data = pd.DataFrame(dic)
# 关联规则中不考虑多次购买同一件物品,删除重复数据
data = data.drop_duplicates()
# 初始化列表
data_set = []
# 分组聚合,同一用户购买多种商品的合并为一条数据,只有1件商品的没有意义,需要进行过滤
groups = data.groupby(by='user_id')
for group in groups:
if len(group[1]) >= 2:
data_set.append(group[1]['item_id'].tolist())
# L为频繁项集,support_data为支持度
L, support_data = apriori(data_set, min_support=0.5)
association_rules = association_rules(L, support_data, min_conf=0.7)
# print('关联规则:\n',association_rules)
结果如下:
这里还可以直接调包Mlxtend实现:
# 方法二:Mlxtend实现
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
# 创建测试数据
dic = {'user_id':[111,111,
112,112,112,112,
113,113,113,113,
114,114,114,114,
115,115,115,115],
'item_id':['豆奶','莴苣',
'莴苣','尿布','葡萄酒','甜菜',
'豆奶','尿布','葡萄酒','橙汁',
'莴苣','豆奶','尿布','葡萄酒',
'莴苣','豆奶','尿布','橙汁']}
data = pd.DataFrame(dic)
# 关联规则中不考虑多次购买同一件物品,删除重复数据
data = data.drop_duplicates()
# 初始化列表
data_set = []
# 分组聚合,同一用户购买多种商品的合并为一条数据,只有1件商品的没有意义,需要进行过滤
groups = data.groupby(by='user_id')
for group in groups:
if len(group[1]) >= 2:
data_set.append(group[1]['item_id'].tolist())
te = TransactionEncoder()
te_ary = te.fit(data_set).transform(data_set)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
rules = association_rules(frequent_itemsets, min_threshold=0.3)
print('关联规则:\n',rules)
通过测试数据可知,Apriori算法能正常使用,接下来直接读取淘宝用户行为数据,分析用户购买的商品组合关联度,设置最小支持度为0.05,最小置信度为0.3:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
## 方法一:
def apriori(data_set):
"""创建初始候选集,候选项1项集"""
print('创建初始候选项1项集')
c1 = set()
for items in data_set:
for item in items:
# frozenset()返回一个冻结的集合,冻结后集合不能再添加或删除任何元素
item_set = frozenset([item])
c1.add(item_set)
def generate_freq_supports(data_set, item_set, min_support):
"""从候选项集中选出频繁项集并计算支持度"""
print('筛选频繁项集并计算支持度')
freq_set = set() # 保存频繁项集元素
item_count = {} # 保存元素频次,用于计算支持度
supports = {} # 保存支持度
# 如果项集中元素在数据集中则计数
for record in data_set:
for item in item_set:
# issubset()方法用于判断集合的所有元素是否都包含在指定集合中
if item.issubset(record):
if item not in item_count:
item_count[item] = 1
else:
item_count[item] += 1
data_len = float(len(data_set))
# 计算项集支持度
for item in item_count:
if (item_count[item] / data_len) >= min_support:
freq_set.add(item)
supports[item] = item_count[item] / data_len
return freq_set, supports
def generate_new_combinations(freq_set, k):
"""
根据频繁项集,生成新的候选项1项集
参数:频繁项集列表 freq_set 与项集元素个数 k
"""
print('生成新组合')
new_combinations = set() # 保存新组合
sets_len = len(freq_set) # 集合含有元素个数,用于遍历求得组合
freq_set_list = list(freq_set) # 集合转为列表用于索引
for i in range(sets_len):
for j in range(i + 1, sets_len):
l1 = list(freq_set_list[i])
l2 = list(freq_set_list[j])
l1.sort()
l2.sort()
# 若两个集合的前k-2个项相同时,则将两个集合合并
if l1[0:k-2] == l2[0:k-2]:
freq_item = freq_set_list[i] | freq_set_list[j]
new_combinations.add(freq_item)
return new_combinations
def apriori(data_set, min_support, max_len=None):
"""循环生成候选集并计算其支持度"""
print('循环生成候选集')
max_items = 2 # 初始项集元素个数
freq_sets = [] # 保存所有频繁项集
supports = {} # 保存所有支持度
# 候选项1项集
c1 = set()
for items in data_set:
for item in items:
item_set = frozenset([item])
c1.add(item_set)
# 频繁项1项集及其支持度
l1, support1 = generate_freq_supports(data_set, c1, min_support)
freq_sets.append(l1)
supports.update(support1)
if max_len is None:
max_len = float('inf')
while max_items and max_items <= max_len:
# 生成候选集
ci = generate_new_combinations(freq_sets[-1], max_items)
# 生成频繁项集和支持度
li, support = generate_freq_supports(data_set, ci, min_support)
# 如果有频繁项集则进入下个循环
if li:
freq_sets.append(li)
supports.update(support)
max_items += 1
else:
max_items = 0
return freq_sets, supports
def association_rules(freq_sets, supports, min_conf):
"""生成关联规则"""
print('生成关联规则')
rules = []
max_len = len(freq_sets)
# 筛选符合规则的频繁集计算置信度,满足最小置信度的关联规则添加到列表
for k in range(max_len - 1):
for freq_set in freq_sets[k]:
for sub_set in freq_sets[k + 1]:
if freq_set.issubset(sub_set):
frq = supports[sub_set]
conf = supports[sub_set] / supports[freq_set]
rule = (freq_set, sub_set - freq_set, frq, conf)
if conf >= min_conf:
print(freq_set,"-->",sub_set - freq_set,'frq:',frq,'conf:',conf)
rules.append(rule)
return rules
if __name__ == '__main__':
# 读取淘宝用户行为数据
data = pd.read_csv(r'D:\关联分析\关联分析2\tianchi_mobile_recommend_train_user.zip',encoding='ansi')
data = data[['user_id','item_id']]
# 关联规则中不考虑多次购买同一件物品,删除重复数据
data = data.drop_duplicates()
# 初始化列表
data_set = []
# 分组聚合,同一用户购买多种商品的合并为一条数据,只有1件商品的没有意义,需要进行过滤
groups = data.groupby(by='user_id')
for group in groups:
if len(group[1]) >= 2:
data_set.append(group[1]['item_id'].tolist())
# L为频繁项集,support_data为支持度
L, support_data = apriori(data_set, min_support=0.05)
association_rules = association_rules(L, support_data, min_conf=0.3)
# print('关联规则:\n',association_rules)
结果程序运行超过半小时仍未出结果,这是由于Apriori算法使用多重嵌套for循环进行计算,每次更新频繁项集都需要扫描一次整个数据集,当数据量过大时效率不高。这里由于Apriori算法的性能限制,所以考虑用FP-Growth算法寻找频繁项集。
(二) FP-growth算法
1、FP-growth算法基本原理
FP-growth算法基于Apriori构建,但采用了高级的数据结构减少扫描次数,大大加快了算法速度。FP-growth算法只需要对数据库进行两次扫描,而Apriori算法对于每个潜在的频繁项集都会扫描数据集判定给定模式是否频繁,因此FP-growth算法的速度要比Apriori算法快。
优点:一般要快于Apriori。
缺点:实现比较困难,在某些数据集上性能会下降。
适用数据类型:离散型数据。
这里涉及到另外一个指标:提升度(Lift)
Lift(提升度):表示“包含 A 的事务中同时包含 B 的事务的比例”与“包含 B 的事务的比例”的比值。公式表达:Lift = ( P(A & B)/ P(A) ) / P(B) = P(A & B)/ P(A) / P(B)。
提升度反映了关联规则中的 A 与 B 的相关性,提升度 > 1 且越高表明正相关性越高,提升度 < 1 且越低表明负相关性越高,提升度 = 1 表明没有相关性。
2、FP-growth算法实现基本过程如下:
(原文链接:https://blog.csdn.net/youhuakongzhi/article/details/87943503)
1) 第一次扫描数据,得到所有频繁一项集的的计数。然后删除支持度低于阈值的项,将1项频繁集放入项头表,并按照支持度降序排列。
2)第二次扫描数据,将读到的原始数据剔除非频繁1项集,并按照支持度降序排列。至此,通过两次扫描数据建立项头表
3)构建FP树。读入排序后的数据集,插入FP树,插入时按照排序后的顺序,插入FP树中,排序靠前的节点是父节点,而靠后的是子节点。如果有共同的链表,则对应的公用父节点计数加1。插入后,如果有新节点出现,则项头表对应的节点会通过节点链表链接上新节点。直到所有的数据都插入到FP树后,FP树的建立完成。
4)挖掘频繁项集。从项头表的底部项依次向上找到项头表项对应的条件模式基。从条件模式基递归挖掘得到项头表项的频繁项集,同时返回频繁项集对应的节点计数值。
5)如果不限制频繁项集的项数,则返回步骤4所有的频繁项集,否则只返回满足项数要求的频繁项集。
我在这里使用FP-growth算法挖掘频繁项集,把整个FP-growth算法封装到fp_growth包中,
fp_growth.py代码如下:
# encoding: utf-8
"""
fp-growth算法是一个生成频繁项集的算法,其主要利用了FP树的数据结构,
整个生成过程只需要遍历数据集2次
"""
from collections import defaultdict, namedtuple
"""
collections模块中的defaultdict继承自dict,namedtuple继承自tuple
defaultdict会构建一个类似dict的对象,该对象具有默认值
当dict不存在的key时会报KeyError错误,调用defaultdict时遇到KeyError错误会用默认值填充
namedtuple主要用来产生可以使用名称来访问元素的数据对象,通常用来增强代码的可读性
"""
def find_frequent_itemsets(transactions, minimum_support, include_support=False):
"""
挖掘频繁项集,生成频繁项集和对应支持度(频数)
"""
items = defaultdict(lambda: 0) # mapping from items to their supports
# Load the passed-in transactions and count the support that individual
# items have.
for transaction in transactions:
for item in transaction:
items[item] += 1
# Remove infrequent items from the item support dictionary.
items = dict((item, support) for item, support in items.items()
if support >= minimum_support)
# Build our FP-tree. Before any transactions can be added to the tree, they
# must be stripped of infrequent items and their surviving items must be
# sorted in decreasing order of frequency.
def clean_transaction(transaction):
transaction = filter(lambda v: v in items, transaction)
transaction_list = list(transaction) # 为了防止变量在其他部分调用,这里引入临时变量transaction_list
transaction_list.sort(key=lambda v: items[v], reverse=True)
return transaction_list
master = FPTree()
for transaction in map(clean_transaction, transactions):
master.add(transaction)
def find_with_suffix(tree, suffix):
for item, nodes in tree.items():
support = sum(n.count for n in nodes)
if support >= minimum_support and item not in suffix:
# New winner!
found_set = [item] + suffix
yield (found_set, support) if include_support else found_set
# Build a conditional tree and recursively search for frequent
# itemsets within it.
cond_tree = conditional_tree_from_paths(tree.prefix_paths(item))
for s in find_with_suffix(cond_tree, found_set):
yield s # pass along the good news to our caller
# Search for frequent itemsets, and yield the results we find.
for itemset in find_with_suffix(master, []):
yield itemset
class FPTree(object):
"""
构建FP树
所有的项必须作为字典的键或集合成员
"""
Route = namedtuple('Route', 'head tail')
def __init__(self):
# The root node of the tree.
self._root = FPNode(self, None, None)
# A dictionary mapping items to the head and tail of a path of
# "neighbors" that will hit every node containing that item.
self._routes = {}
@property
def root(self):
"""The root node of the tree."""
return self._root
def add(self, transaction):
"""Add a transaction to the tree."""
point = self._root
for item in transaction:
next_point = point.search(item)
if next_point:
# There is already a node in this tree for the current
# transaction item; reuse it.
next_point.increment()
else:
# Create a new point and add it as a child of the point we're
# currently looking at.
next_point = FPNode(self, item)
point.add(next_point)
# Update the route of nodes that contain this item to include
# our new node.
self._update_route(next_point)
point = next_point
def _update_route(self, point):
"""Add the given node to the route through all nodes for its item."""
assert self is point.tree
try:
route = self._routes[point.item]
route[1].neighbor = point # route[1] is the tail
self._routes[point.item] = self.Route(route[0], point)
except KeyError:
# First node for this item; start a new route.
self._routes[point.item] = self.Route(point, point)
def items(self):
"""
Generate one 2-tuples for each item represented in the tree. The first
element of the tuple is the item itself, and the second element is a
generator that will yield the nodes in the tree that belong to the item.
"""
for item in self._routes.keys():
yield (item, self.nodes(item))
def nodes(self, item):
"""
Generate the sequence of nodes that contain the given item.
"""
try:
node = self._routes[item][0]
except KeyError:
return
while node:
yield node
node = node.neighbor
def prefix_paths(self, item):
"""Generate the prefix paths that end with the given item."""
def collect_path(node):
path = []
while node and not node.root:
path.append(node)
node = node.parent
path.reverse()
return path
return (collect_path(node) for node in self.nodes(item))
def inspect(self):
print('Tree:')
self.root.inspect(1)
print
print('Routes:')
for item, nodes in self.items():
print(' %r' % item)
for node in nodes:
print(' %r' % node)
def conditional_tree_from_paths(paths):
"""从给定的前缀路径构建一个条件fp树."""
tree = FPTree()
condition_item = None
items = set()
# Import the nodes in the paths into the new tree. Only the counts of the
# leaf notes matter; the remaining counts will be reconstructed from the
# leaf counts.
for path in paths:
if condition_item is None:
condition_item = path[-1].item
point = tree.root
for node in path:
next_point = point.search(node.item)
if not next_point:
# Add a new node to the tree.
items.add(node.item)
count = node.count if node.item == condition_item else 0
next_point = FPNode(tree, node.item, count)
point.add(next_point)
tree._update_route(next_point)
point = next_point
assert condition_item is not None
# Calculate the counts of the non-leaf nodes.
for path in tree.prefix_paths(condition_item):
count = path[-1].count
for node in reversed(path[:-1]):
node._count += count
return tree
class FPNode(object):
"""FP树节点"""
def __init__(self, tree, item, count=1):
self._tree = tree
self._item = item
self._count = count
self._parent = None
self._children = {}
self._neighbor = None
def add(self, child):
"""Add the given FPNode `child` as a child of this node."""
if not isinstance(child, FPNode):
raise TypeError("Can only add other FPNodes as children")
if not child.item in self._children:
self._children[child.item] = child
child.parent = self
def search(self, item):
"""
Check whether this node contains a child node for the given item.
If so, that node is returned; otherwise, `None` is returned.
"""
try:
return self._children[item]
except KeyError:
return None
def __contains__(self, item):
return item in self._children
@property
def tree(self):
"""The tree in which this node appears."""
return self._tree
@property
def item(self):
"""The item contained in this node."""
return self._item
@property
def count(self):
"""The count associated with this node's item."""
return self._count
def increment(self):
"""Increment the count associated with this node's item."""
if self._count is None:
raise ValueError("Root nodes have no associated count.")
self._count += 1
@property
def root(self):
"""True if this node is the root of a tree; false if otherwise."""
return self._item is None and self._count is None
@property
def leaf(self):
"""True if this node is a leaf in the tree; false if otherwise."""
return len(self._children) == 0
@property
def parent(self):
"""The node's parent"""
return self._parent
@parent.setter
def parent(self, value):
if value is not None and not isinstance(value, FPNode):
raise TypeError("A node must have an FPNode as a parent.")
if value and value.tree is not self.tree:
raise ValueError("Cannot have a parent from another tree.")
self._parent = value
@property
def neighbor(self):
"""
The node's neighbor; the one with the same value that is "to the right"
of it in the tree.
"""
return self._neighbor
@neighbor.setter
def neighbor(self, value):
if value is not None and not isinstance(value, FPNode):
raise TypeError("A node must have an FPNode as a neighbor.")
if value and value.tree is not self.tree:
raise ValueError("Cannot have a neighbor from another tree.")
self._neighbor = value
@property
def children(self):
"""The nodes that are children of this node."""
return tuple(self._children.itervalues())
def inspect(self, depth=0):
print((' ' * depth) + repr(self))
for child in self.children:
child.inspect(depth + 1)
def __repr__(self):
if self.root:
return "<%s (root)>" % type(self).__name__
return "<%s %r (%r)>" % (type(self).__name__, self.item, self.count)
if __name__ == '__main__':
from optparse import OptionParser
import csv
p = OptionParser(usage='%prog data_file')
p.add_option('-s', '--minimum-support', dest='minsup', type='int',
help='Minimum itemset support (default: 2)')
p.add_option('-n', '--numeric', dest='numeric', action='store_true',
help='Convert the values in datasets to numerals (default: false)')
p.set_defaults(minsup=2)
p.set_defaults(numeric=False)
options, args = p.parse_args()
if len(args) < 1:
p.error('must provide the path to a CSV file to read')
transactions = []
with open(args[0]) as database:
for row in csv.reader(database):
if options.numeric:
transaction = []
for item in row:
transaction.append(long(item))
transactions.append(transaction)
else:
transactions.append(row)
result = []
for itemset, support in find_frequent_itemsets(transactions, options.minsup, True):
result.append((itemset, support))
result = sorted(result, key=lambda i: i[0])
for itemset, support in result:
print(str(itemset) + ' ' + str(support))
6)从频繁项集中挖掘关联规则,筛选满足最小置信度的关联规则,计算支持度和置信度。
在这里我最小支持度(频数)设置为50,最小置信度设置为0.3,代码如下:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import itertools # itertools用于高效循环的迭代函数集合
import sys,time
sys.path.append(r"D:\关联分析\关联分析2") # 调包失败时,把包所在路径添加到系统路径中
import fp_growth as fpg # 导入fp-growth算法
start =time.time()
def convertData(data):
"""数据转换为同一单号包含多个商品的列表"""
# 关联规则中不考虑多次购买同一件物品,删除重复数据
data = data.drop_duplicates()
# 初始化列表
itemSets = []
it_list = []
# 分组聚合,同一用户购买多种商品的合并为一条数据,只有1件商品的没有意义,需要进行过滤
groups = data.groupby(by=['user_id'])
for group in groups:
if len(group[1]) >= 2:
itemSets.append(group[1]['item_id'].tolist())
return itemSets
def generate_association_rules(patterns, total, min_confidence):
"""
生成关联规则,计算支持度、置信度和提升度
"""
antecedent_list = []
consequent_list = []
support_list = []
confidence_list = []
lift_list = []
count_antecedent = []
p_antecedent = []
count_consequent = []
p_consequent = []
count_ant_con = []
for itemset in patterns.keys():
upper_support = patterns[itemset] # A & B
for i in range(1, len(itemset)):
for antecedent in itertools.combinations(itemset, i):
"""
itertools.combinations()用于创建一个迭代器,
返回iterable中所有长度为r的子序列,
返回的子序列中的项按输入iterable中的顺序排序
"""
antecedent = tuple(sorted(antecedent))
consequent = tuple(sorted(set(itemset) - set(antecedent)))
if antecedent in patterns:
lower_support = patterns[antecedent] # A
consequent_support = patterns[consequent] # B
p_lower_support = lower_support / total # P(A)
p_consequent_support = consequent_support / total # P(B)
support = round(float(upper_support) / total, 6) # 支持度Support = P(A & B)
confidence = float(upper_support) / lower_support # 置信度Confidence = P(A & B)/ P(A)
lift = confidence / p_consequent_support # 提升度Lift = ( P(A & B)/ P(A) ) / P(B) = P(A & B)/ P(A) / P(B)
if confidence >= min_confidence:
antecedent_list.append(list(antecedent))
consequent_list.append(list(consequent))
support_list.append(support)
confidence_list.append(confidence)
lift_list.append(lift)
count_antecedent.append(lower_support) # count(A)
p_antecedent.append(p_lower_support) # P(A)
count_consequent.append(consequent_support) # count(B)
p_consequent.append(p_consequent_support) # P(B)
count_ant_con.append(upper_support) # count(AB)
rules_col = {'antecedent':antecedent_list,
'consequent':consequent_list,
'count_antecedent':count_antecedent,
'antecedent_support':p_antecedent,
'count_consequent':count_consequent,
'consequent_support':p_consequent,
'count_ant_con':count_ant_con,
'support':support_list,
'confidence':confidence_list,
'lift':lift_list}
rules = pd.DataFrame(rules_col)
# col = ['antecedent','consequent','count_antecedent','antecedent_support',
# 'count_consequent','consequent_support','count_ant_con','support',
# 'confidence','lift']
col = ['antecedent','consequent','support','confidence','lift']
rules = rules[col]
rules.sort_values(by=['support','confidence'],ascending=False,inplace=True)
return rules
if __name__ == '__main__':
# 导入数据
data = pd.read_csv(r'D:\关联分析\关联分析2\tianchi_mobile_recommend_train_user.zip',encoding='ansi')
data = data[['user_id','item_id']]
# 转换数据
dataset = convertData(data)
total = len(dataset)
print('总订单数:',total)
'''
find_frequent_itemsets()调用函数生成频繁项集和频数
minimum_support表示设置最小支持度(频数),即频数大于等于minimum_support,保存此频繁项,否则删除
include_support表示返回结果是否包含支持度(频数),若include_support=True,返回结果中包含itemset和support,否则只返回itemset
'''
frequent_itemsets = fpg.find_frequent_itemsets(dataset, minimum_support=50, include_support=True)
result = []
for itemset, support in frequent_itemsets: # 将generator结果存入list
result.append((itemset, support))
result = sorted(result, key=lambda i: i[0]) # 排序后输出
item_list = []
itemset_list = []
support_list = []
for itemset, support in result:
# print(str(itemset) + ' ' + str(support)) #频繁项集和出现次数
item_list.append(itemset) # 保存为列表,用于输出频繁项集结果
itemset = tuple(sorted(itemset)) # 先转换为元组,用于后续生成关联规则
itemset_list.append(itemset)
support_list.append(support)
# 构建字典
patterns = dict(zip(itemset_list,support_list))
print('频繁项集总数:',len(patterns))
# 生成关联规则,计算支持度、置信度和提升度
# min_confidence代表最小置信度
rules = generate_association_rules(patterns,total,min_confidence=0.3)
print('关联规则:\n',rules.head())
print('结果总数:',len(rules))
## 输出结果,输出到同一份excel文件不同的工作表中
# 输出频繁集
sup = {'item_id':item_list,'frequency':support_list}
sup = pd.DataFrame(sup)
sup['support'] = round(sup['frequency'] / float(total), 6)
sup.sort_values(by=['support'],ascending=False,inplace=True)
sup_col = ['item_id','frequency','support']
sup = sup[sup_col]
writer = pd.ExcelWriter(r'D:\关联分析\关联分析2\result\fp-growth-result.xlsx')
sup.to_excel(excel_writer=writer,sheet_name='support',index=False)
# 输出关联规则
rules.to_excel(excel_writer=writer,sheet_name='rules',index=False)
end = time.time()
print('Running time: %s Seconds'%(end-start))
运行结果部分截图如下
频繁集结果:四、结论
业务结论:
1、 频繁项集我按支持度进行了排序,商品编号112921337最受欢迎,远远高于其他商品;
2、 从总体上看,所有组合商品中支持度数值偏低,这是由于平台销售的商品种类繁多;
3、 所有商品组合按支持度从高到低排序,商品组合中 [387911330] à [97655171] 和 [97655171] à [387911330] 支持度最高,但是商品组合[387911330] à [97655171]的置信度最高,表示购买商品编号387911330的用户中有44%会购买商品编号97655171,可以对这两种商品进行捆绑销售;
4、 置信度最高的商品组合是购买商品126902916的用户最大可能会购买商品125666923,但出现的概率偏低;
技术结论:
1、 Apriori和FP-Growth这两个算法都有个特点,就是当选取支持度很小时,计算时间明显变长,性能影响很大;
2、 数据量大时优先考虑FP-Growth算法查找频繁集;
3、 用FP-Growth算法寻找频繁项集,支持度先设的大一些,一般可设项集数目的1/10,这里的项集数目不是指原始表数据,而是事务数据集;
4、 根据业务经验,选择恰当的支持度及可信度,才会分析出恰当的结论。尤其是支持度,选大了,会过滤掉可能关键或有意义的关联规则输出;选小了,会产生太多频繁项集及FP条件树,干扰分析结果;
5、 如果有数据聚合处理需求,应尽量减少for循环使用,尤其是嵌套两层以上的情形,这对性能会是一个灾难。有条件的情况下,可使用DataFrame包的groupby函数、pivot_table函数,以及pandas包的merge函数等,能获得极大的性能提升。
参考链接:https://blog.csdn.net/youhuakongzhi/article/details/87943503