The Prediction of Disclosure Risk in The Numerical Database.

download print

About this sample

About this sample


Words: 2569 |

Pages: 6|

13 min read

Published: Jul 17, 2018

Words: 2569|Pages: 6|13 min read

Published: Jul 17, 2018

The inside data in the organization may increase rapidly with time. In order to reduce the cost of organization, they may choose third-party storage provider to store entire data. There is a leakage crisis when the provider cannot be trusted. Another scenario, a dealer collects all transaction data and publishes to the data analysis company for marketing purpose. It may reveal privacy when the company is malicious. For this reason, preserving privacy in the database becomes a very important issue. This paper concerns prediction disclosure risk in the numerical database. We present an efficient noise generation which relies on Huffman coding algorithm. We also build a noise matrix that can add intuitively noise to the original value. Moreover, we adopt clustering technique before generating noise. The result shows the running time of noise generation of clustering scheme is faster than unclustering scheme.

'Why Violent Video Games Shouldn't Be Banned'?

Technology brings convenience, and the cloud computing technique rises in recent years. The inside data in the organization may increase rapidly. In spite of organization may build storage space by himself, but they may publish these data into the data analysis company for some marketing purposes. Hence, the data mining techniques play an important role in the Knowledge Discovery in Databases (KDD). But the malicious data analysis company may record personal data when organization publishes the statistical database to the company end. If the company is not trusted, there is a leakage crisis. For these reasons, it leads to privacy research becomes more popular these years. Statistical Data Bases (SDBs) are used to produce a result of statistical aggregates, such as sum, average, max, and min. The results of statistical aggregates do not reveal the content of any single individual tuple. However, the user may ask many legal queries to infer confidential information from the gaining of database responses.

In recent years enhancing the security of statistical database has gotten a lot of attention. The problem of security in the classical statistical database involves three different roles [17]: statistician, who interest is to gain aggregate data; data owner, who wishes individual records are security; database administrator, who needs to satisfy both of above roles. The privacy challenges in statistical database are classified into two aspects [15]: for data owner, it should avoid data theft by a hacker, avoid data abuse by the service provider, and restrict user access right; for the user, it should hide query content, and the database does not reveal query detail. There are many approaches have been proposed. Navarro-Arribas and Torra organize four categories of approaches as following [16]:

1) Perturbative methods, which modify the original data to reach a degree of privacy. They usually called noise;

2) Non-perturbative methods, that technique masks the data without introducing error. In contrast to perturbative methods, data is not distortion;

3) Cryptographic methods, which use classical cryptography system;

4) Synthetic data generation, which generates random data while retaining a relationship with the original data.

In order to protect confidential information in the database, Statistical Disclosure Control (SDC) is most used for a privacy-preserving solution on the statistical database. Micro-Aggregation Techniques (MATs) are considered to the family of SDC and belong to perturbative methods. The method of micro-aggregation has many attractive features including robust performance, consistent responses, and ease of implementation [6]. A user is able to get useful information since this method would not reduce the information in the content. In other words, there is minimal information loss through this method.

Furthermore, we review some approaches to preserving privacy [1-5,8.12-14,17]. In particular, the micro-aggregation scheme is attracted to be used on statistical databases in these years, because it replaces the original value, lower distortion, to prevent the identity and prediction disclosure. And the replaced data has not lead to the problem for data analysis or data mining applications. All of the records in the database can be represented to a data point in coordinate systems.

This paper considers a combination of two or more nonconfidential attributes, such as age, weight, can be used to link an individual. Such set of attributes is collectively called a quasi-identifier. A popular approach to replacing original data is to use the clustering-based technique to prevent identity disclosure. Hence, the adversary may be confused when the original data is replaced by grouping measure. Although the data in the dataset is homogeneous by clustering-based technique, there is a problem of prediction disclosure.

2.Proposed scheme.

The paper concerns the problem of prediction disclosure that quasi-identifier is generalized by homogenous of micro-aggregation method. The quasi-identifier has one or more attributes may link to an individual. For briefly, we only consider a quasi-identifier with two attributes. First, all values of quasi-identifier are converted to a data point on the coordinate system. To address prediction disclosure, the homogenous values after the process of original micro-aggregation method do cluster first. Then we generate noise based on the centroid of these groups. In order to enhance the speed of noise injection, all noise values are formed into a set, which is called noise matrix in this paper. Each original value corresponds to a noise value. In this section, we introduce the concept of micro-aggregation, and next illustrate the clustering technique which is based on Prim’s MST. The paper mainly idea are generating noise and noise injection procedure. These of two will describe in the rest of this section.


Microaggregation technique is the family of statistical disclosure control and is applied to numerical data, categorical data, sequences, and heterogeneous data [16]. It computes a value to represent a group and replace original value to confuse the adversary. All of the records are formed a group with its nearest records. is a constant value, threshold, preset by data protector. It is higher, the degree of privacy is higher, but data quality is lower. In the contrast, is lower, the degree of privacy is lower, but data quality is higher. It is a trade-off between data disclosure risk and less information loss. Although this method may damage the original data and may lead to data distortion. But it just makes sure that the low levels of data distortion. It has not affected database operation. Therefore, minimizing the information loss is a major challenge of this method. There are two main operations for micro-aggregation, that are partition and aggregation which we describe in detail as follows:

Partition: records are the partition to several disjoint groups, and each group is included at least records.

Aggregation: each record in the group is replaced by the centroid of the group, which is a computed value to represent the group.

2.2MST Clustering.

We adopt Prim’s minimum-cost spanning tree clustering technique which is proposed by Lazlo and Mukherjee in 2005 [11].

The first step, the proposed clustering technique relies on Prim’s minimum-cost spanning tree, which is constructed based on all records in the dataset. Prim’s algorithm is a greedy algorithm that finds a minimum-cost spanning tree for a connected edge undirected graph. It finds a subset of edges to form a minimum-cost spanning tree that connects to all nodes, where the total weight of all edges is minimized. Some notation is defined to facilitate the discussion. Each record with more attributes in the dataset D can be converted to a data point on the coordinate systems and is considered a node u in the minimum-cost spanning tree. The node u can be connected to the other node v in the dataset D and forms an edge e(u,v), u,vD. All of the edges can be computed to a value by random two nodes in the dataset. That computed value can be used as a weight w for each edge. According to Prim’s algorithm, it first selects a single node uD and builds a minimum-cost spanning tree F={u}, no edges. The next step of Prim’s algorithm selects another node v F-D, where v is closest to the set F and is closest to the node u. There is a new edge e(u,v) be formed by two nodes u, vD, and node v points to parent node u and add v to the set F, F={u,v}. Each node points to its parent node in the tree, but the initial node points null. In this case, the node u points null. This is an iterative process until F=D. Prim’s algorithm selects a single node, regarded as a root of the tree, in the graph to grow to a minimum-cost spanning tree. The total weight in all selected edges is minimized. The result of Prim’s MST algorithm is shown in Fig 1, where nodes of the tree are connected by red lines and the number of weight is near to each edge.

The second step, in order to partition all nodes to form a cluster in the MST, we should consider that how many edges in the MST are removable. The idea is to visit all edges in the MST from longest to shortest and determines the edge cutting while retaining the remaining edges. After edge cutting, the MST partitions to several subtrees and these can be formed a cluster. All of the edges are assigned to a priority queue in descending order. Then, we obtain an edge in sequence from the priority queue and consider each edge whether is removable, where is visiting node and is the parent node of. We consider the two subtrees size from visiting node and parent node respectively, and determine each size is greater than which preset by the protector. The edge is removable when both of two subtrees sizes are greater than, respectively. In the contrast, the edge is not removable. First, we obtain a subtree size from visiting node by, where is used to obtain subtree size from the node. Second, we consider the root node from visiting node toward its parent node.

Then we obtain another subtree size by. For briefly illustrate, we suppose these two subtrees size are greater than, that edge is removable. We remove the edge from the priority queue and replace the parent pointer to represent it is a root node of the subtree.

The final step is a simple processing for all nodes partition to the disjoint cluster. Each root of a subtree can be formed into a cluster by traversal its descendant nodes. We find out all node which parent pointer is, and assign to a set of the root. The parent pointer with which represents the root of the subtree and each subtree can be formed a cluster, where is a set of clusters, where. We obtain a root node from the front of the root set and traverse all descendant nodes by following the root node of the subtree. After traversing the subtree, the root node and it's all descendant nodes can form a new cluster. And then remove root node from. We can find next cluster follows above procedure. This is an iterative process until. Finally, all nodes are partitioned into disjoint clusters.

2.3Noise Generation.

After clustering all data points, the next step is generating noise based on the centroid of these groups. Our scheme is based on Huffman coding which proposed by Huffman in 1952 [9]. Huffman coding algorithm is popular on data compression technique [7][10]. We can identify distinct data point by building Huffman coding tree. Because Huffman coding has some features, such as 1) each character has a corresponding Huffman code; 2) the character with higher probability has shorter Huffman code. In the contrast, the character with lower probability has longer Huffman code. These features can be used on generating noise to preserve privacy in the database. There is longer noise to be injected into original data with lower probability, easy reveal privacy, to confuse adversary. In the other words, the data with high probability means not easy reveal privacy for personal.

2.4Noise Injection procedure.

As mentioned above, the noise is built by Huffman coding tree based on the probability of original value, then it is converted to a set, we called the noise matrix in this paper. Each data point is the original value v may correspond to a noise valuer in the noise matrix. This method can simplify the process of the original value of the disturbance, and the noise injection process easier and more intuitive. After the building noise matrix, we describe the process of noise injection. We put the sting of noise into the queue, and add ‘1’ sequentially to original data by the function of the least significant bit (LSB) until the queue is empty. Due to the use of the LSB function disturb the original value, the data distortion may be significantly reduced.


We consider running time of noise generation that calculates per unit time in milliseconds. In order to estimate the precise time, we obtain the average of 61 times of running time of noise generation. Our experiments were conducted to explore time changes between unclustering and clustering. Which the unclustering scheme has not include MST clustering technique. Moreover, the clustering scheme has various k which are group size preset by data protector. We also discuss the time changes of instances from 10 to 1,000.

The experimental results show the running time of noise generation will be slower when the records are increased. The running time of noise generation of clustering scheme is faster than unclustering scheme. In the experiments, we also find a noise in the running time in clustering scheme, but overall the growth of time is very smooth. In addition to running time examining, we also explore the data quality after noise injection procedure. The measure of data quality is to follow Domingo-Ferrer proposed in 2002. The experimental results show less information loss in the unclustering scheme. But all of the information loss results are not exceed to 50 percentages. However, it is a trade-off between minimum information loss and value disclosure risk. Summary, our proposed scheme is efficient to generate noise to preserve privacy in the database.


Get a custom paper now from our expert writers.

Due to cloud computing technique becomes very popular in these years, and technology brings convenience. It leads to the inside data in the organization increase rapidly. The concept of a database as a service has been proposed in 2002. However, it may reveal personal privacy when all the data published to the third-party service provider, but it cannot be trusted. Another scenario is when dealer collects transaction data about personal and publishes to the data analysis company for some research or marketing purposes, but the company is malicious. It also has leakage crisis. For this reason, how to preserve privacy in the database becomes more important in recent years. Although security issues in the database are a huge problem, this paper only concerns prediction disclosure issue that adversary is able to predict the confidential value of an individual. We present an efficient noise generation scheme which relies on Huffman coding algorithm. We also build a noise matrix that can add intuitively noise to the original value. Moreover, records in the dataset are partitioned into the disjoint cluster before generating noise. Our scheme can only be used in a numerical database or statistical database. In the future, we will consider non-numeric values and propose a conversion mechanism. The mechanism is adaptive to our scheme and can be converted between non-numeric value and numeric value. When all non-numeric can be converted to a numeric value, it can adapt to our scheme and extend this study.

Image of Alex Wood
This essay was reviewed by
Alex Wood

Cite this Essay

The prediction of disclosure risk in the numerical database. (2018, December 04). GradesFixer. Retrieved April 13, 2024, from
“The prediction of disclosure risk in the numerical database.” GradesFixer, 04 Dec. 2018,
The prediction of disclosure risk in the numerical database. [online]. Available at: <> [Accessed 13 Apr. 2024].
The prediction of disclosure risk in the numerical database. [Internet] GradesFixer. 2018 Dec 04 [cited 2024 Apr 13]. Available from:
Keep in mind: This sample was shared by another student.
  • 450+ experts on 30 subjects ready to help
  • Custom essay delivered in as few as 3 hours
Write my essay

Still can’t find what you need?

Browse our vast selection of original essay samples, each expertly formatted and styled


Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.


    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts


    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.



    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!


    Get Your
    Personalized Essay in 3 Hours or Less!

    We can help you get a better grade and deliver your task on time!
    • Instructions Followed To The Letter
    • Deadlines Met At Every Stage
    • Unique And Plagiarism Free
    Order your paper now