Skip to main content
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian
  • Log in
  • My Cart

Main menu

  • Home
  • Articles
    • Current
    • Latest Articles
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • Archive
  • Front Matter
  • News
    • For the Press
    • Highlights from Latest Articles
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home

Advanced Search

  • Home
  • Articles
    • Current
    • Latest Articles
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • Archive
  • Front Matter
  • News
    • For the Press
    • Highlights from Latest Articles
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses

New Research In

Physical Sciences

Featured Portals

  • Physics
  • Chemistry
  • Sustainability Science

Articles by Topic

  • Applied Mathematics
  • Applied Physical Sciences
  • Astronomy
  • Computer Sciences
  • Earth, Atmospheric, and Planetary Sciences
  • Engineering
  • Environmental Sciences
  • Mathematics
  • Statistics

Social Sciences

Featured Portals

  • Anthropology
  • Sustainability Science

Articles by Topic

  • Economic Sciences
  • Environmental Sciences
  • Political Sciences
  • Psychological and Cognitive Sciences
  • Social Sciences

Biological Sciences

Featured Portals

  • Sustainability Science

Articles by Topic

  • Agricultural Sciences
  • Anthropology
  • Applied Biological Sciences
  • Biochemistry
  • Biophysics and Computational Biology
  • Cell Biology
  • Developmental Biology
  • Ecology
  • Environmental Sciences
  • Evolution
  • Genetics
  • Immunology and Inflammation
  • Medical Sciences
  • Microbiology
  • Neuroscience
  • Pharmacology
  • Physiology
  • Plant Biology
  • Population Biology
  • Psychological and Cognitive Sciences
  • Sustainability Science
  • Systems Biology
Research Article

Boosting association rule mining in large datasets via Gibbs sampling

Guoqi Qian, Calyampudi Radhakrishna Rao, Xiaoying Sun, and Yuehua Wu
PNAS May 3, 2016 113 (18) 4958-4963; first published April 18, 2016; https://doi.org/10.1073/pnas.1604553113
Guoqi Qian
aSchool of Mathematics and Statistics, The University of Melbourne, Parkville, VIC 3010, Australia;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Calyampudi Radhakrishna Rao
bDepartment of Biostatistics, University at Buffalo, The State University of New York, Buffalo, NY 14221-3000;
cCRRAO Advanced Institute of Mathematics, Statistics and Computer Science, Hyderabad-500046, India;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: crr1@psu.edu
Xiaoying Sun
dDepartment of Mathematics and Statistics, York University, Toronto, ON, Canada M3J1P3
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yuehua Wu
dDepartment of Mathematics and Statistics, York University, Toronto, ON, Canada M3J1P3
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  1. Contributed by Calyampudi Radhakrishna Rao, March 19, 2016 (sent for review December 3, 2015; reviewed by Marepalli Rao and Talluri Jagannadha Rao)

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Significance

Informative association rule mining is fundamental for knowledge discovery from transaction data, for which brute-force search algorithms, e.g., the well-known Apriori algorithm, were developed. However, operating these algorithms becomes computationally intractable in searching large rule space. The stochastic search algorithm developed here tackles this challenge by using the idea of annealing Gibbs sampling. Large rule space of exponential order can still be randomly searched by this algorithm to generate an ergodic Markov chain of viable length. The ergodic chain contains the most informative rules with probability 1, creating a much reduced rule space for subsequent mining without information loss. Such capacity of the algorithm is demonstrated using a carefully designed simulation study and a real genomic dataset containing 1,067 items.

Abstract

Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling–induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.

  • association rule
  • Gibbs sampling
  • transaction data
  • genomic data

Association rule mining (1, 2) in many research areas such as marketing, politics, and bioinformatics is an important task. One of its well-known applications is the market basket analysis. An example of association rule from the basket data might be that “90% of all customers who buy bread and butter also buy milk” (1), providing important information for the supermarket's management of stocking and shelving. Instead of mining all association rules from a database, an interesting and useful task is to discover the most important association rules for a given consequent. For instance, a store manager of Walmart might be interested in knowing which items most of the customers purchased given that they got automotive services done in the store. For a genomic dataset, one might be interested in finding which SNP (single nucleotide polymorphism at certain loci in a gene) variables and their values imply a certain disease with the highest probability. The focus of this paper is to identify the most important association rules in a transaction dataset.

Let us formally define the problem of association rule mining using the notations of ref. 3. Define I={I1,I2,…,Im} as a set of m items called the “item space” and D={t1,t2,…,tn} as a list of transactions, where each transaction in D is just a subset of items in I, i.e., tj⊂I, j=1,…,n. An association rule is defined as an implication of the form X⇒Y where X,Y⊂I and X∩Y=Ø. The sets of items (for short, “itemsets”) X and Y are called “antecedent” and “consequent” of the rule, respectively. The support of an itemset, X, supp(X) is defined as the proportion of transactions in D which contain X. The confidence of an association rule is defined as conf(X⇒Y)=[supp(X&Y)]/supp(X), where X&Y is the itemset obtained by amalgamating X with Y. The support of an itemset measures its commonness and the confidence of an association rule measures its association strength. By the essential meaning of support, we can also define the support for a rule X⇒Y, which is just supp(X⇒Y)≡supp(Y⇒X)≡supp(X&Y).

Constraint-based search is mostly used in current algorithms to mine association rules. For instance, the Apriori algorithm (1) mines all rules satisfying a user-specified minimum support or minimum confidence, and maximum length. It is difficult to use such an algorithm in a dense dataset because it either searches through too many rules being computationally infeasible if the constraint is low, or misses the important ones otherwise. Some rule-mining algorithms use well-defined metrics to identify the most important association rules (4). But, they also use deterministic and exhaustive search, consequently becoming computationally intractable when applied to a dense dataset with, say, a few hundred items in the item space.

In this paper, we present a stochastic search algorithm to mine the most important, or optimal, association rules from a transaction dataset without information loss. The motivation comes from a genomic dataset of a disease and hundreds of SNP variables, and from the desire to mine the most important association rules for the disease outcome. Because the deterministic search algorithms are not able to cope with the computing intensity and immensity for this dataset, we have developed the stochastic algorithm to overcome the difficulty.

A New Algorithm Based on Gibbs Sampling

Motivation.

Consider a dataset for supervised learning which contains observations of a response variable and a number of predictor variables from a sample of individual subjects. Such a dataset can be converted into a transaction dataset for association rule mining if both the response and the predictors are of categorical type. For example, datasets used in genome-wide association studies often consist of observations of categorical response and predictors on subjects. Here the response is a disease outcome having two categories, case (C) and noncase (NC), and each predictor is the so-called SNP variable having 3 categories corresponding to 0, 1, and 2 copy numbers of the minor allele at the loci. In this case, the response variable can be represented by 2 response items, and each predictor variable can be represented by 3 predictor items.

To present the above discussion more explicitly, let n be the total number of transactions and k be the total number of predictor items. Denote the two response items as IC and INC. Then the association rules of our interest have the antecedent being a subset of {I1,…,Ik} and the consequent being either IC or INC. These rules represent the associations between values of various predictor variables and the response variable which are different from those revealed by a supervising learning model such as the logistic or log-linear regression model.

An alternative representation of the transactions is binary vectors. Let Js=1 or 0 indicate the presence or absence of item s for s=1,…,k,C,NC. Denote J=(J1,…,Jk). The components of this binary vector are not necessarily independent of each other and the involved dependence provides a probabilistic interpretation to the associations among all of the items. Each transaction is an observation of the binary vector. Therefore, the collection of association rules of our interest can be divided into two families as ℛC={J⇒IC, (J1,…,Jk)∈{0,1}k} and ℛNC={J⇒INC, (J1,…,Jk)∈{0,1}k}. Let I={I1,I2,…,Ik}. Denote the power set of I by 2I that is the itemset space consisting of all possible itemsets of I. Given the consequent being either IC or INC, one has to search through 2I for all possible association rules. The following two properties clearly hold for this transaction dataset:

  • Property 1: 0≤supp(J⇒I−)≤conf(J⇒I−)≤1, where I− represents IC or INC.

  • Property 2: Because JC+JNC=1, conf(J⇒IC)+conf(J⇒INC)=1.

Our interest is to find association rules with high confidence. A constraint-based algorithm like the Apriori is computationally challenging when the item space is too large. It is even more difficult when the rules with high confidence have very low support. An example given in ref. 5 is that the forestry society FallAll conducted association rules mining to a dataset of 1,000 observations on marsh sides for providing advice on draining swamps to grow new forests. The Apriori algorithm was applied to this dataset by specifying the minimum support and confidence as 0.05 and 0.80, respectively. But, a strong association rule of confidence 1.0 and support 0.04 was missed with this set of constraints. In general, mining association rules in a dense dataset can miss important rules and get misinformed by noninformative rules produced due to improper constraints. This mishap motivates us to propose a random sampling and search procedure by defining a probability distribution for all possible association rules from the power set of I, i.e., 2I, to find out those rules having high-level combined importance of confidence and support.

New Algorithm.

The probability distribution for sampling and searching important association rules entails incorporating both support and confidence of the rules into the procedure. For this, we first define a new measure for association rules in ℛC∪ℛNC and call it the “importance,” which is of the form g(J⇒I−)=g−(J)=f(supp(J⇒I−),conf(J⇒I−)), for a given association rule J⇒I− with I− being IC or INC. Here f is a user-specified positive increasing function reflecting certain combined importance of the support and confidence of the rule. Plausible choices of f are the minimum, summation, or product of the support and confidence. Once f is specified, our aim becomes finding the most important association rules in ℛC and ℛNC which can be achieved by the following random-sampling-based search procedure.

We illustrate the idea of this procedure by focusing on rules in ℛC. The same applies for finding the most important rules in ℛNC. In light of the non-Bayesian optimization idea of ref. 6, we propose a probability distribution defined on ℛC aspC(J)=P(J⇒IC)=eξg(J⇒IC)∑all Jeξg(J⇒IC),[1]where ξ>0 is a tuning parameter. The most important rule in ℛC, denoted as Jopt⇒IC, is also the one maximizing pC(J) over ℛC, i.e., Jopt=arg maxJpC(J). This implies that Jopt can be found (with probability 1) from a random sample of J s generated from pC(J) if the sample size is sufficiently large. It can be proved that Jopt appears most frequently and has the largest value of g(J⇒IC) in the sample with probability 1. However, generating a random sample from pC(J) is not trivial when k is not small, because the rule space ℛC becomes huge and the normalizing denominator in pC(J) becomes intractable in evaluation. It turns out that the method of Gibbs sampling can be used to generate random samples from pC(J), where we need all conditional probability distributions of Js given J−s:pC(Js=1|J−s)=pC(Js=1,J−s)pC(J−s)=pC(Js=1,J−s)pC(Js=1,J−s)+pC(Js=0,J−s),pC(Js=0|J−s)=1−pC(Js=1|J−s)for s=1,2,…,k. Here J−s is the subvector of J with Js removed and (Js,J−s) is the vector with Js being put back into its original position in J.

Then the Gibbs sampling algorithm for generating J s from pC(J) is given as the following:

  • i) Arbitrarily choose an initial vector J(0)=(J1(0),…,Jk(0));

  • ii) Repeating for j=1,2,…,M, the antecedent J(j) of the rule (J(j)⇒IC) is obtained by generating Js(j),s=1,2,…,k sequentially from the Bernoulli distribution pC(Js|J1(j),…,Js−1(j),Js+1(j−1),…,Jk(j−1));

  • iii) Return (J(1),…,J(M)) for the association rules sample {J(j)⇒IC;j=1,⋯,M}.

The generated sequence {J(1),⋯,J(M)} is actually a Markov chain with its stationary distribution being pC(J) and it can be shown that the most frequent rule occurring in the generated sample converges to Jopt with probability 1 as M→∞. Moreover, those most important association rules in ℛC are more likely to appear the most frequently in the generated sample than other less important ones, provided that the sample size M is sufficiently large. In the cases that the importance values of many important association rules are large but very close to each other, choosing a larger value for the tuning parameter ξ increases the probability ratio of every two rules, [pC(J1)]/pC(J2)=eξ(g(J1⇒IC)−g(J2⇒IC)), which helps differentiate the more important rules from the less important ones.

This is the framework of our random search procedure. We remark that the function g in [1] can be replaced by another interesting measure of association rules such as lift and leverage (4). Thus, a random sample can also be easily generated according to that interesting measure.

Once {J(1),⋯,J(M)} is generated, the optimal association rules in ℛC, which have the highest importance, can be approximated by the association rules with the near-highest frequencies in the sample. The approximation precision can be achieved as high as one wants provided that the sample size is sufficiently large. Note that if the item space is very large, the generation of a long sample is computationally expensive. However, it is possible that in the random sample of a relatively small size M, the association rules could all be different from each other and each has the same frequency 1/M. In this case, it is possible that none of the rules is optimal. Instead, we can compute the frequency for each item that ever appeared in the antecedents of the sampled rules. The frequency for item Ii is ∑j=1MJi(j)/M for i=1,2,…,k. We would obtain a subset of items that appear most frequently. Then we can apply the Apriori algorithm on the itemset space generated by the selected items to mine the optimal rules. Our simulation study shows that the random sample obtained by the Gibbs sampling method can largely reduce the itemset space for search and retain the most frequent predictor items from the optimal association rules simultaneously. In the next section we will elaborate how to use the generated sample of rules.

Simulation Study and Real Data Application

In this section, we present several numerical examples, first based on simulated data, and then on real data to demonstrate the performance of the random-sampling-based search procedure in different scenarios.

Simulation Studies.

A transaction dataset containing strong association rules can be obtained by using the R package MultiOrd (7) to generate a list of binary vectors from a multivariate Bernoulli distribution of correlated binary random variables with a compatible pair of mean vector p and correlation matrix R (8). We start with a small dataset to show that our method is able to find the optimal association rules which are the same as the ones found by using the Apriori algorithm.

Example 1.

Suppose a small transaction dataset has k=3 predictor items I1,I2,I3 and two response items IC,INC. Also suppose that the marginal probability of vector (J1,J2,J3,JC) is p=(0.5,0.5,0.5,0.5) and the correlation matrix for (J1,J2,J3,JC) isR=(1000.801000010.20.800.21).Then we generate n=100 binary vectors of (J1,J2,J3,JC) according to (p,R). We compute JNC=1−JC. Then we obtain a transaction dataset containing 100 transactions on 5 items I1,I2,I3,IC,INC. For each response item, there is in total 2k−1=7 possible association rules. We first use the Apriori algorithm (3) to mine all association rules of the form (J⇒IC) or (J⇒INC) with support and confidence greater than 0 and summarize the results in Table 1. We choose g−(J)=supp(J⇒I−)×conf(J⇒I−) with I−=IC or INC depending on mining ℛC or ℛNC, for illustration purposes. Then we use the proposed Gibbs sampling algorithm to generate three random samples of size M=1,000 of association rules from the transaction dataset by choosing ξ=3,6,10, respectively. The frequency of each association rule appearing in each sample is shown in Table 1. The rank of the frequency conforms to that of the importance g−(J), showing the good performance of our method. It is easy to see that the frequencies have more power to differentiate the most important rules from the less important ones, as the value of ξ increases.

View this table:
  • View inline
  • View popup
Table 1.

Association rules and their measurements

Next we illustrate how to use the random search procedure and how well it performs on three more complex datasets.

Example 2.

Consider an item space I=(I1,I2,…,I398,IC,INC) with k=398 predictor items and two response items. Set each marginal probability of (J1,J2,J3,JC) to 0.8, and the marginal probability of the other predictor items to 0.2, i.e.,p={p1,p2,p3,p4,…,p398,pC}={0.8,0.8,0.8,0.2,…,0.2,0.8}.The correlation matrix R between items is set to be an identity matrix except that R(Js1,Js2)=0.99 where s1,s2∈{1,2,3,C}. Then we generate n=300 binary vectors from (J1,J2,…,J398,JC) according to (p,R). The transaction dataset T1 is accordingly formed to contain 400 items and 300 transactions knowing that the status of INC in each transaction is completely determined by JNC=1−JC.

Example 3.

The transaction dataset T2 has the same item space, the same number of transactions, and the same correlation matrix as T1 but a different marginal probability vectorp={p1,p2,p3,p4,…,p20,p21,…,p398,pC}={0.8,0.8,0.8,0.5…,0.5,0.2,…,0.2,0.8}.

Example 4.

The transaction dataset T3 also has l=400 items and n=300 transactions. The marginal probability vector isp={p1,p2,p3,p4,…,p10,p11,…,p398,pC}={0.8,0.8,0.8,0.6…,0.6,0.2,…,0.2,0.8}.The correlation matrix R is an identity matrix except thatR(Js1,Js2)=0.9, for s1≠s2; s1,s2∈{1,2,3,C},R(Js1,Js2)=0.5, for s1≠s2; s1,s2∈{4,…,10,C},R(Js1,Js2)=0.5, for s1∈{1,2,3},s2∈{4,5,…,10}.From the settings of T1, T2, and T3, we see that items I1, I2, and I3 have high support and the antecedents of the important association rules in these datasets most likely contain some of I1, I2, and I3. We now use the Apriori algorithm and the new Gibbs-sampling-based search procedure to see whether we can unveil these attributes in T1, T2, and T3.

To mine the association rules in ℛC of each transaction dataset, a random sample of size M=100 association rules is generated from each ℛC using the new algorithm. We find that the larger ξ is, the more frequently the three items I1,I2,I3 appear in the generated sample. When ξ=100, all items ever appearing in the sample are I1,I2,I3, and I390. Proportions of the sampled association rules containing each of (I1,I2,I3,I390) from T1, T2, and T3 are shown in Table 2. The item I390 appears only once in each sample, thus seeming not to have high support in the datasets.

View this table:
  • View inline
  • View popup
Table 2.

Items appearing in the random sample

We then apply the Apriori algorithm with the constraint of minimum support 0.05 and minimum confidence 0.6 on the search. This identifies 31,525, 170,600, and 442,191 association rules from T1,T2,T3 respectively. The 10 most frequent items appearing in these rules for each dataset and their respective proportions of appearance are shown in Table 3. For each dataset the top 10 of the identified rules according to the importance g(⋅) are also calculated and presented in Table 4, together with their respective frequencies of appearance in the corresponding random sample generated. Ranks of the top 10 rules in terms of the frequencies in Table 4 more or less conform to their ranks in terms of the importance measure. We find that as the dependence structure of the transaction dataset becomes more complicated, our algorithm can generate a random sample containing the most important association rules that are confirmed by the Apriori algorithm.

View this table:
  • View inline
  • View popup
Table 3.

Top 10 frequent items appearing in the rules identified by the Apriori algorithm for each dataset T1,T2, or T3

View this table:
  • View inline
  • View popup
Table 4.

Top 10 important association rules from T1 (left), T2 (middle), and T3 (right) and their frequencies in the relevant sample

To mine the association rules in ℛNC, we first use the Apriori algorithm with the minimum support and confidence setting (0.05, 0.6) for each dataset, and find no rules. But, a conclusion of having no important association rules in ℛNC cannot be drawn yet because it is computationally infeasible to use the Apriori algorithm to search a larger collection of itemsets by weakening the minimum support and confidence constraint. Then, in the hope of making a difference we use the proposed algorithm with various values for ξ. The number of items ever appearing in the generated samples decreases from 398 to around 100 when ξ=1,000. But, it could not be further reduced by increasing ξ except for T1. Also we are still unable to find any important rules from the generated samples. Hence we tend to conclude that there are no important association rules in ℛNC for the simulated datasets T1, T2, and T3. Now we apply the Apriori algorithm with minimum support and confidence (0.05, 0.6) on the subset of each transaction dataset that includes only INC and other items ever appearing in each generated random sample using ξ=1,000, and again are unable to find any important rules of the form (J⇒INC). These results conform to the setting used to simulate the transaction datasets T1, T2, and T3, where pNC=0.2 is small.

From Examples 2–4, we see that our method is capable of finding the most important association rules that also appear most frequently in the random sample generated by properly choosing a large value for ξ. In cases where the item space is large and the support of rules is very low, our proposed algorithm can be combined with the Apriori algorithm to more efficiently tackle the association rule mining task.

Real Data Application.

We apply the proposed Gibbs sampling method to analyze a case-control dataset that contains genomic observations for n=229 women, 39 of which are breast cancer cases obtained from the Australian Breast Cancer Family Study (ABCFS) (9) and 190 of which are controls from the Australian Mammographic Density Twins and Sisters Study (AMDTSS) (10). The dataset is formed by sampling from a much larger data source from ABCFS and AMDTSS. Each woman in the dataset has 366 genetic observations being the genotype outcomes (from a Human610-Quad beadchip array) of the 366 SNPs on a specific gene pathway suspected to be susceptible to breast cancer. An SNP variable typically takes a value from 0, 1, and 2, representing the number of the minor alleles at the SNP loci. But, in the current dataset there are 31 SNPs, with only 2 of the 3 possible values being observed. Our task is to find out whether there are any SNPs having significant associations with the risk of breast cancer and what these SNPs are. One could use a logistic model to tackle this task. But, it is difficult due to that the number of predictor variables (i.e., SNPs) in the data is much larger than the number of observations, and the SNPs are highly associated with each other due to linkage disequilibrium. Because this dataset can be easily turned into a transaction one, we are able to use an association rule-mining method to undertake the task. The binary transaction dataset converted from our case-control dataset contains 1,067 predictor (SNP) items (denoted as I1,…,I1067) and 2 response items IC (breast cancer) and INC (no breast cancer). It is easy to see that 0≤supp(J⇒IC)≤0.17 and 0≤supp(J⇒INC)≤0.83. We choose the importance measure of association rules as g−(J)=supp(J⇒I−)×conf(J⇒I−), I− being either IC or INC for illustration purposes. Now our aim is to find the most important association rules for IC and INC.

To find an estimate of the most important rule in ℛNC, we first generate a random sample of M=300 association rules from the converted transaction dataset with ξ=100, 200, or 300, respectively. The frequency of each association rule appearing in each sample is computed and shown in Table 5. We can see that the difference in the frequencies of association rules becomes larger as the value of ξ increases, which makes the most important association rules stand out. The top 10 frequent items ever appearing in each of the three samples and their proportions of appearance in the respective sample are shown in the upper section of Table 6, where we see the item I136 is the most frequent item ever appearing in each generated sample. Moreover, the most important association rule I136⇒INC appears the most frequently in each random sample (see the top part in Table 5).

View this table:
  • View inline
  • View popup
Table 5.

Top 10 important association rules in ℛNC for the SNPs data and their sampling frequencies of appearance

View this table:
  • View inline
  • View popup
Table 6.

Top 10 frequent items appearing in the random samples of association rules for IC (lower section, i.e., the bottom six rows) and INC (upper section)

We then try to use the Apriori algorithm to find the most important association rules in ℛNC with different specifications of the minimum support and confidence setting. The top 10 important association rules can be found by the Apriori algorithm with the minimum support 0.6 and minimum confidence 0.8 setting, and their various measures are shown in Table 5. It can be shown that the association rule I136⇒INC is indeed the most important rule in RNC which is the same rule found by the stochastic search and confirms the good performance of our proposed method.

For the association rules in ℛC, the support of any of them is not greater than 0.17. Because the support of rules is too low and the item space is very large, the Apriori algorithm cannot cope with the computing intensity and immensity involved, even with the setting of minimum support 0.2 and minimum confidence 1. So, we try to use our proposed method to find the most important rule with consequent IC or reduce the size of the item space. The number of items appearing in the generated samples decreases from 1,067 to about 35 by increasing ξ from 10 to 6,000. But, it cannot be further reduced by larger value of ξ. The top 10 frequent items ever appearing in the generated samples are reported in the lower portion of Table 6. For illustration purposes we choose ξ=6,000, with which the number of distinct items appearing in the random sample is 35. We apply the Apriori algorithm on the subset of transaction dataset including only these 35 items by specifying the minimum support and confidence as 0.2 and 1, respectively. The Apriori algorithm is still not implementable. So, we then single out a subset of 22 items from the 35 items which appeared in at least three-fourths of the sampled association rules and cut out a new subset of the original transaction dataset by including only these 22 items in the transactions. By specifying the minimum support and confidence as 0.05 and 0.6, a total number of 286,188 association rules have been found in the new subset transaction data. The top 10 important association rules among them are reported in Table 7. From Table 7, we see that the measurements of importance of these association rules are very low and close to each other. It is not possible to find out these rules by applying the Apriori algorithm alone. Our proposed Gibbs-sampling-based algorithm can be used to reduce the number of items for mining; the reduced data subset is exactly where the Apriori algorithm can be applied to find the most important association rules subject to negligible information loss. One could look into these rules or the frequent items in Tables 6 and 7 to find out the biological meaning behind them.

View this table:
  • View inline
  • View popup
Table 7.

Top 10 association rules for IC after reducing the item space

Acknowledgments

This work is partially supported by the Natural Sciences and Engineering Research Council of Canada. G.Q.’s research is also partly supported by the Australian National Health and Medical Research Council Project Grant APP1033452.

Footnotes

  • ↵1To whom correspondence should be addressed. Email: crr1{at}psu.edu.
  • Author contributions: G.Q., C.R.R., and Y.W. designed research; G.Q., C.R.R., X.S., and Y.W. performed research; G.Q., X.S., and Y.W. analyzed data; and G.Q., C.R.R., X.S., and Y.W. wrote the paper.

  • Reviewers: M.R., University of Cincinnati; and T.J.R., Indian Statistical Institute.

  • The authors declare no conflict of interest.

View Abstract

References

  1. ↵
    1. Agrawal R,
    2. Imielinski T,
    3. Swami A
    (1993) Mining association rules between sets of items in large databases. ACM SIGMOD Rec 22(2):207–216
    .
    OpenUrlCrossRef
  2. ↵
    1. Agrawal R,
    2. Srikant R
    (1994) Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Data Bases (Morgan Kaufmann, San Francisco), pp 487–499
    .
  3. ↵
    1. Hahsler M,
    2. Grün B,
    3. Hornik K
    (2005) arules – A computational environment for mining association rules and frequent item sets. J Stat Softw 14(15):1–25
    .
    OpenUrl
  4. ↵
    1. Bayardo RJ,
    2. Agrawal R
    (1999) Mining the most interesting rules. 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York), pp 145–154
    .
  5. ↵
    1. Hämäläinen W
    (2009) Statapriori: An efficient algorithm for searching statistically significant association rules. Knowl Inf Syst 23(3):373–399
    .
    OpenUrl
  6. ↵
    1. Qian G,
    2. Field C
    (2002) Using MCMC for logistic regression model selection involving large number of candidate models. Monte Carlo and Quasi-Monte Carlo Methods 2000, eds Fang K-T, et al. (Springer, Berlin), pp 460–474
    .
  7. ↵
    1. Amatya A,
    2. Demirtas H
    (2015) MultiOrd: An R package for generating correlated ordinal data. Commun Stat Simul Comput 44(7):1683–1691
    .
    OpenUrlCrossRef
  8. ↵
    1. Chaganty NR,
    2. Joe H
    (2006) Range of correlation matrices for dependent Bernoulli random variables. Biometrika 93(1):197–206
    .
    OpenUrlCrossRef
  9. ↵
    1. Dite GS, et al.
    (2003) Familial risks, early-onset breast cancer, and BRCA1 and BRCA2 germline mutations. J Natl Cancer Inst 95(6):448–457
    .
    OpenUrlCrossRefPubMed
  10. ↵
    1. Odefrey F, et al., Australian Twins and Sisters Mammographic Density Study
    (2010) Common genetic variants associated with breast cancer and mammographic density measures that predict disease. Cancer Res 70(4):1449–1458
    .
    OpenUrlAbstract/FREE Full Text
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Boosting association rule mining in large datasets via Gibbs sampling
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Boosting association rule mining by Gibbs sampling
Guoqi Qian, Calyampudi Radhakrishna Rao, Xiaoying Sun, Yuehua Wu
Proceedings of the National Academy of Sciences May 2016, 113 (18) 4958-4963; DOI: 10.1073/pnas.1604553113

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Boosting association rule mining by Gibbs sampling
Guoqi Qian, Calyampudi Radhakrishna Rao, Xiaoying Sun, Yuehua Wu
Proceedings of the National Academy of Sciences May 2016, 113 (18) 4958-4963; DOI: 10.1073/pnas.1604553113
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley
Proceedings of the National Academy of Sciences: 113 (18)
Table of Contents

Submit

Sign up for Article Alerts

Article Classifications

  • Physical Sciences
  • Statistics

Jump to section

  • Article
    • Abstract
    • A New Algorithm Based on Gibbs Sampling
    • Simulation Study and Real Data Application
    • Acknowledgments
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Penguin swimming
Origin and diversification of penguins
Juliana Vianna and Rauri Bowie explain the origin and diversification of penguins.
Listen
Past PodcastsSubscribe
Opinion: Cultural and linguistic diversities are crucial pillars of biodiversity
To best manage natural systems, modern societies must consider alternative views and interpretations of the natural world.
Inner Workings: Sub buoys prospects for 3D map of marine microbial communities
Implications range from elucidating metabolic pathways that help facilitate greenhouse gas release, to revealing compounds for medicine or pollution remediation.
Image credit: Mak Saito (Woods Hole Oceanographic Institution, Woods Hole, MA).
Ancient genomes reveal demographic history of France
A large genomic dataset reveals ancient demographic events that accompanied the transition to agriculture and changes in metallurgic practices in France.
Image credit: Pixabay/DavidRockDesign.
Satellite in orbit
Orbital-use fees in satellite industry
A study finds that imposing a tax on orbiting satellites could increase the value of the satellite industry from $600 billion to $3 trillion by 2040 by decreasing collision risks and space debris.
Image credit: NASA.

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Latest Articles
  • Archive

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Librarians
  • Press
  • Site Map
  • PNAS Updates

Feedback    Privacy/Legal

Copyright © 2020 National Academy of Sciences. Online ISSN 1091-6490