This blog is for all those Data Mining and algorithm freaks
Generally Data Mining can be done in four ways ::
- Associative Rule Learning :- Searches for relationships between variables. For example a supermarket might gather data of what each customer buys. Using association rule learning, the supermarket can work out what products are frequently bought together, which is useful for marketing purposes. This is sometimes referred to as "market basket analysis".
- Regression :- Attempts to find a function which models the data with the least error. A common method is to use Genetic Programming.
- Classification :- Arranges the data into predefined groups. For example an email program might attempt to classify an email as legitimate or spam. Common algorithms include Nearest neighbor, Naive Bayes classifier and Neural network.
- Clustering :- Is like classification but the groups are not predefined, so the algorithm will try to group similar items together.
The above definitions are taken from Wikipedia.com. In this blog I mainly concentrate on Associative Rule learning (Market Basket Analysis) and Apriori algorithm.
Associative Rule Learning (Market Basket Analysis ) :-
Market Basket Analysis (MBA) is a technique that helps you to identify the relationship between pairs of products purchased together. The prime aim of this technique is to identify cross-selling opportunities. Retail business is getting a lot of benefits from MBA.
We can discuss MBA through an example. Using MBA you might unfold the fact that customers tend to buy bread and butter together. Based on this information you might organize your store so that bread and butter are next to each other. Now let us discuss the important measures in MBA.
Frequency :-
The number of times two products were purchased together. Suppose bread and butter found together in 1700 baskets then 1700 would be its frequency.
Support :-
Support = frequency / Total number of orders.
If 1700 bread and butter were purchased together and your store had 2000 orders the support for this would be calculated as 1700/2000 = 85%
Confidence :-
Confidence is the ratio between number of times the pairs purchased and number of times one of the items in the pair was purchased. In mathematical terms it is known as conditional probability. In our example if bread was purchased 1800 times and out of those 1800 purchased 1700 contained butter we would have a confidence of 1700 / 1800 = 94.4%
Now we can check a MBA report. In this report user would select the product they are interested in performing the analysis. In our example it is bread. So it would list all the products that were purchased together with bread.
Product | Frequency | Support | Confidence |
Butter | 1700 | 85.0% | 94.4% |
Jam | 1400 | 70.0% | 77.7% |
Beer | 400 | 20.0% | 22.2% |
Based on this report the merchandiser can take necessary decisions. High confidence means there is a strong relationship between the products. Low support means the pair is not purchased a lot. The pair which is having high confidence and high support is the dream pair.
Algorithms:-
One of the main challenges in database mining is developing fast and efficient algorithms that can handle large volumes of data because most mining algorithms perform computation over the entire database and often the databases are very large. For the smooth functioning of any Data Mining project we need strong and efficient algorithms as all we know algorithms are the backbone of any software.
Apriori is the best known algorithm to mine association rules. It uses breadth first search strategy to counting the support of items.
The below part is exclusively for all those who are interested in algorithms.
Apriori Algorithm :-
I cannot give an example better than something given in Wikipedia to explain this algorithm.
A large supermarket tracks sales data by SKU (item), and thus is able to know what items are typically purchased together. Apriori is a moderately efficient way to build a list of frequent purchased item pairs from this data. Let the database of transactions consist of the sets {1,2,3,4}, {2,3,4}, {2,3}, {1,2,4}, {1,2,3,4}, and {2,4}. Each number corresponds to a product such as "butter" or "water". The first step of Apriori to count up the frequencies, called the supports, of each member item separately:
Item | Support |
1 | 3 |
2 | 6 |
3 | 4 |
4 | 5 |
We can define a minimum support level to qualify as "frequent," which depends on the context. For this case, let min support = 3. Therefore, all are frequent. The next step is to generate a list of all 2-pairs of the frequent items. Had any of the above items not been frequent, they wouldn't have been included as a possible member of possible 2-item pairs. In this way, Apriori prunes the tree of all possible sets.
Item | Support |
{1,2} | 3 |
{1,3} | 2 |
{1,4} | 3 |
{2,3} | 4 |
{2,4} | 5 |
{3,4} | 3 |
This is counting up the occurences of each of those pairs in the database. Since minsup=3, we don't need to generate 3-sets involving {1,3}. This is due to the fact that since they're not frequent, no supersets of them can possibly be frequent. Keep going:
Item | Support |
{1,2,4} | 3 |
{2,3,4} | 3 |
Note that {1,2,3} and {1,3,4} were not generated or tested. At this point, since the only 4-set is {1,2,3,4} and its subset {1,3} has been ruled out as infrequent.
In much larger datasets, especially those with huge amounts of items present in low quantities and small amounts of items present in big quantities, the gains made by pruning the possible pairs tree like this can be very large.
Pseudo Code :-
Join Step :: Ck is generated by joining Lk-1with itself
Prune Step :: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k- itemset
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
L1 = {frequent items};
for(k= 1; Lk!=∅; k++) do begin
Ck+1= candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t.
Lk+1 = candidates in Ck+1with min_support
end
return Lk;
References ::-
http://en.wikipedia.org/wiki/Data_mining
http://en.wikipedia.org/wiki/Association_rule_learning
Books ::
Data Mining: Practical Machine Learning Tools and Techniques by Ian H Witten, Eibe Frank.
Githen Ronney Work for MindTree as a BI consultant