# Market Basket Analysis Using Association Rule-Mining

**What is Market Basket Analysis?**

Market basket analysis is a technique used by retailers to find patterns in customer behavior based on their history of transactions .It is used to determine what items are frequently bought together or placed in the same basket by customers. It uses this purchase information to leverage effectiveness of sales and marketing. Market basket Analysis(MBA) looks for combinations of products that frequently occur in purchases and has been prolifically used since the introduction of electronic point of sale systems that have allowed the collection of immense amounts of data.

**Types Of Market Basket Analysis:**

- Predictive MBA is used to classify cliques of item purchases, events and services that largely occur in sequence.
- Differential MBA removes a high volume of insignificant results and can lead to very in-depth results. It compares information between different stores, demographics, seasons of the year, days of the week and other factors.

MBA is commonly used by online retailers to make purchase suggestions to consumers. For example, when a person buys a particular model of smartphone, the retailer may suggest other products such as phone cases, screen protectors, memory cards or other accessories for that particular phone. This is due to the frequency with which other consumers bought these items in the same transaction as the phone.

MBA is also used in physical retail locations. Due to the increasing sophistication of point of sale systems coupled with big data analytics, stores are using purchase data and MBA to help improve store layouts so that consumers can more easily find items that are frequently purchased together.

## What is Association Rule-Mining?

Association rule mining is a technique to identify frequent patterns and associations among a set of items.

For example, understanding customer buying habits. By finding correlations and associations between different items that customers place in their ‘shopping basket,’ recurring patterns can be derived.

Say, Joshua goes to buy a bottle of wine from the supermarket. He also grabs a couple of chips as well. The manager there analyses that, not only Joshua, people often tend to buy wine and chips together. After finding out the pattern, the manager starts to arrange these items together and notices an increase in sales.

This process of identifying an association between products/items is called association rule mining. To implement association rule mining, many algorithms have been developed. Apriori algorithm is one of the most popular and arguably the most efficient algorithms among them. Let us discuss what an Apriori algorithm is.

# What Is an Apriori Algorithm?

Apriori algorithm assumes that any subset of a frequent itemset must be frequent.

Say, a transaction containing {wine, chips, bread} also contains {wine, bread}. So, according to the principle of Apriori, if {wine, chips, bread} is frequent, then {wine, bread} must also be frequent.

# How Does the Apriori Algorithm Work?

The key concept in the Apriori algorithm is that it assumes all subsets of a frequent itemset to be frequent. Similarly, for any infrequent itemset, all its supersets must also be infrequent.

Let us try and understand the working of an Apriori algorithm with the help of a very famous business scenario, market basket analysis.

Here is a dataset consisting of six transactions in an hour. Each transaction is a combination of 0s and 1s, where 0 represents the absence of an item and 1 represents the presence of it.

We can find multiple rules from this scenario. For example, in a transaction of wine, chips, and bread, if wine and chips are bought, then customers also buy bread.

{wine, chips} => {bread}

In order to select the interesting rules out of multiple possible rules from this small business scenario, we will be using the following measures:

**Support****Confidence****List****Conviction**

Remember I told y’all that we’ll get back to the three most popular criteria evaluating the quality or the strength of an association rule. There are **support, confidence **and** lift**:

1. Support is the percentage of transactions containing a particular combination of items relative to the total number of transactions in the database. The support for the combination A and B would be,

P(AB) or P(A) for Individual A

2. Confidence measures how much the consequent (item) is dependent on the

antecedent (item). In other words, confidence is the conditional probability of the consequent given the antecedent,

P(B|A)

where P(B|A) = P(AB)/P(A)

3. Lift (also called improvement or impact) is a measure to overcome the

problems with support and confidence. Lift is said to measure the difference — measured in ratio — between the confidence of a rule and the expected confidence. Consider an association rule “if A then B.” The lift for the rule is defined as

P(B|A)/P(B) or P(AB)/[P(A)P(B)].

As shown in the formula, lift is symmetric in that the lift for “if A then B” is the same as the lift for “if B then A.”

4. Each criterion has its advantages and disadvantages but in general we would like association rules that have high confidence, high support, and high lift.

As a summary,

Confidence = P(B|A)

Support = P(AB)

Lift = P(B|A)/P(B)

## The libraries used here are:-

1)NumPy

2)Pandas

3)Matplotlib

4) Mlextend

**Now let’s move towards implementing part:**

## Importing Libraries:

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

import pandas as pd

from mlxtend.preprocessing import TransactionEncoder

from mlxtend.frequent_patterns import apriori

from mlxtend.frequent_patterns import association_rules

## Loading CSV file

df=pd.read_csv(r’D:\groceries1.csv’,header=None)

df.head()

**Output:**

## Data Preparation

records = []

for i in range (0, 9835):

records.append([str(df.values[i,j]) for j in range(0, 20)])

TE = TransactionEncoder()

array = TE.fit(records).transform(records)

#building the data frame rows are logical and columns are the items have been purchased

df1 = pd.DataFrame(array, columns = TE.columns_)

df1

**Now we will remove null values which are present in dataset**

df_clean = df1.drop([‘nan’], axis = 1)

df_clean

**Now we will find Top 20 selling items and we will visualize using Matplot library**

count = df_clean.loc[:,:].sum()

df2 = count.sort_values(0, ascending = False).head(20)

df2 = df2.to_frame()

df2 = df2.reset_index()

df2 = df2.rename(columns = {“index”: “items”,0: “count”})

df2

**Now we will Visualize the Top 20 selling items**

Now we will find Item percentage and cummulative percentage

tot_item = sum(df_clean.sum())

df2[‘Item_percent’] = df2[‘count’]/tot_item

df2[‘Tot_percent’] = df2.Item_percent.cumsum()

df2.head(20)

This shows us that the top five items are responsible for 21.4% of the entire sales and only the top 20 items are responsible for over 50% of the sales! This is important for us, as we don’t want to find association rules for items which are bought very infrequently. With this information we can limit the items we want to explore for creating our association rules. This also helps us in keeping our possible item set number to a manageable figure. So we will remove less frequently sold items.

def prune_dataset(olddf, len_transaction, tot_sales_percent):

# Delete the last column tot_items if present

if ‘tot_items’ in olddf.columns:

del(olddf[‘tot_items’])

#Finding the item_count for each item and total number of items.

#This is the same code as in step 3

Item_count = olddf.sum().sort_values(ascending = False).reset_index()

tot_items = sum(olddf.sum().sort_values(ascending = False))

Item_count.rename(columns={Item_count.columns[0]:’Item_name’,Item_count.columns[1]:’Item_count’}, inplace=True)

# Code from Step 3 to find Item Percentage and Total Percentage.

Item_count[‘Item_percent’] = Item_count[‘Item_count’]/tot_items

Item_count[‘Tot_percent’] = Item_count.Item_percent.cumsum()

# Taking items that fit the condition/ minimum threshold for total sales percentage.

selected_items = list(Item_count[Item_count.Tot_percent < tot_sales_percent].Item_name)

olddf[‘tot_items’] = olddf[selected_items].sum(axis = 1)

# Taking items that fit the condition/ minimum threshold for length of transaction or number of items in a row.

olddf = olddf[olddf.tot_items >= len_transaction]

del(olddf[‘tot_items’])

#Return pruned dataframe.

return olddf[selected_items], Item_count[Item_count.Tot_percent < tot_sales_percent]

output_df, item_counts = prune_dataset(df_clean,

2,0.4)

print(output_df.shape)

print(list(output_df.columns))

output_df

## For implementing Apriori Algorithm we will use Mlxtend.

First we will find the frequent itemset with support.

frequent_itemsets = apriori(output_df, min_support=0.01, use_colnames=True)

frequent_itemsets[‘length’] = frequent_itemsets[‘itemsets’].apply(lambda x: len(x))

frequent_itemsets

Now we will find the association rules on basis of lift. The threshold value of lift is greater than or equal to one. That means probability of buying one item after buying other is greater than or equal to 1.

rules_mlxtend = association_rules(frequent_itemsets, metric=”lift”, min_threshold=1)

rules_mlxtend

Now we will find the rules whose confident is greater than equal to 0.3 and value of lift is greater than 1.

rules_mlxtend[ (rules_mlxtend[‘lift’] > 1) & (rules_mlxtend[‘confidence’] >= 0.3) ]

Now we will find the length of antecendents as we want to find the data having antecedents more than or equal to 2.

rules_mlxtend[“antecedent_len”] = rules_mlxtend[“antecedents”].apply(lambda x: len(x))

rules_mlxtend

rules_mlxtend[ (rules_mlxtend[‘antecedent_len’] >= 2) &

(rules_mlxtend[‘confidence’] >= 0.3) &

(rules_mlxtend[‘lift’] >= 1) ]

## Conclusion:

- The most popular item in this data set is whole milk followed by vegetables and rolls/buns.
- By applying the Apriori algorithm and association rules we can have a better insight on what items are more likely to be bought together.

## For Code and Dataset:

https://github.com/ghadiyaaysh17601/market_basket_analysis

I am thankful to mentors at **https://internship.suvenconsultants.com** for providing awesome problem statements and giving many of us a Coding Internship Exprience. Thank you www.suvenconsultants.com.