This essay has been submitted by a student. This is not an example of the work written by professional essay writers.

Application of Text Mining for Converting Text Data to Structured Format

downloadDownload printPrint

Remember! This is just a sample.

You can get your custom paper by one of our expert writers.

Get custom essay

121 writers online

Download PDF


In the many of the collaborative networks like a portal for e-commerce, social media or a feedback system the data are written is the textual formats. If the penetration of the media is available to large stakeholder, then the data becomes big over a period. These data are stored in a distributed storage mechanism, but the eco-system for storing unstructured data like the text is not available into the most of the existing databases to handle the large unstructured data that makes the prediction analysis challenging. In this proposed study a grammar-based algorithm is aimed to be developed to filter the redundant terms from the textual sentences and build a tabular datapoint from the textual dataset. The performance analysis of the algorithm will be done for the consistency check of the method by observing the value data conversion rate. The methodology adopted is mathematical modeling and simulation of a numerical computing platform. The study outcome is useful for developing efficient predictive models for textual data.


The advancements into mobile computing and communications systems along with the cloud computing paradigms that provide the true sense platform for the ubiquitous computing in the real-time scenario. This eco-system makes it possible to create and access the distributed application data at any time and any location with any size of data. The operational cost along with capital investment reduces drastically if the services of the cloud are used by the organizations. The data generated by the users through many smart devices, data from the portals of different applications, data from the surveillance systems, data generated by the social and microbloggers, if they are stored, processed and accessed on the cloud infrastructure that provides the foundation of the new dimension of digging the insights from the data and build many applications and helps for many decisions.

A use case example of hospital demonstrates various forms of the data that includes formats of text or pdf for bills, rich text format for the admission record, images from radiology and many other logs and advises as an XML formats along with many sensor data which are used to measure some critical information of the patient body. The continuity of storage of these varied data provides volumes data over time with verity in nature of unstructured and semi-structured. The eco-system to store these data directly is not efficient and robust to handle it in the schema-less data structure, that makes analytics task more inaccurate. There exist many such examples where organization immensely generate such type of data which are termed as large or Big Data whose sizes are perceived to be in petabytes/zettabytes. The massive volume of the Big Data is the primary problem that gives rise to other associated problems. The biggest impact of such problems is an analysis of it as it comprises of various distinct stages of implementation e.g. i) acquisition, recording, ii) extraction, cleaning, annotation, iii) integration, aggregation, representation, iv) analysis, modeling, and v) interpretation.

Till date, there have been various investigation towards big data in the cloud but the focus of the majority of the researchers are towards analytical or modeling phase. Although this is an important stage of big data analysis, other stages are left unattended that causes the evolution of various research problems. A closer study into the existing analytical-based approaches towards Big data problem shows that complexities are not visualized with full clarity in association to multi-tenancy clusters. It should be known that the adoption of multi-tenancy clustering is widely practiced as it offers cost saving for the clients.

Although, we have already stepped into the era of Big Data, the approaches to perform analysis and overcome the research gaps are still vague. We believe that only the efficient form of analysis that addresses all the problems of big data to some extent combined would be really helpful. However, the present studies address a few sets of problems associated with big data analytics. Therefore, we discuss the fundamental information about big data as well as we investigate the existing research work to understand the effectiveness of existing techniques of optimization in Big Data.

Review of Literature

The term big data is consistently imposing challenges to the existing organization not only in terms of storage but also in terms of applying analytical operation (Trovati et al. (2016) and Mazumder et al. (2017). Although, storage is not much difficult to be achieved using cloud environment but performing an analytical operation on big data is yet an unsolved problem (Marjani et al. (2017) and Lv et al. (2017). This is because a data can be only termed as big data when it is characterized by 5Vs i.e. volume, variety, veracity, velocity, and value Marr et al. (2015) and Li et al (2017). There are also various reports e.g. Puthal et al. (2018) and Prasad et al. (2017), where it is claimed that various sensory applications of Internet-of-Things (IoT) generates a massive volume of big data. Organization e.g. IBM has a significant contribution most recently by introducing the IBM Watson project for investigating big data analytics over IoT (Put AI to Work, (Retrieved on 17th August, 2018). Such initiatives of big data analytics often contribute to cost reduction enhance decision formulation and new avenues of services and products. However, there are some open challenges are required to be sorted out e.g. i) mechanism of storing unstructured data and retaining maximum data quality, ii) data privacy, iii) origination of heterogeneous data from diverse sources, iv) process of effective segmentation of unstructured data and perform an effective filtering of useful data, v) precise analysis of structured as well as unstructured data, vi) big data is a new concept in data science and organization lacks skilled expertize to handle big data analytics (Han et al. (2018). Hence, there is a need for a dedicated research work that focuses on addressing the problems associated with big data analytics.

Not only this, there are various forms of software and tools offered by Apache for processing big data, which are at present under study phase. Some of them are already started using by the industry while some are still in the investigation stage. The adoption of Hadoop and MapReduce is highly used by the researcher followed by Hbase and Neo4J. Other types of tools are not reportedly found to be implemented most recently. Hence, in a nutshell, the NoSQL-based tool is making progressive entry towards Big data management. NoSQL database management system is basically used for storage operation as well as information extraction. The mechanism of data structure deployed in this system is very different from conventional SQL based system in order to achieve faster response time. At present, there are different forms of NoSQL database management system as shown in figure 1: Apart from the above-mentioned database management techniques of big data, there are various other solutions too e.g. Sybase, Teradata, Essbase, etc. At present, many preferences are offers to parallel storage processing system that includes massively parallel processing. A different name of the organization that has possession of a parallel storage processing system is shown in Table.

At present, there is various research works towards the big data-based approaches on a sensory application. The problem associated with veracity, volume, and velocity is addressed by designing a recommendation system as a unique analytical approach as seen in the work of (Habibzadeh et al. 2018). Usage of fog intelligence is proven to offer better analytical performance for sensor-based big data. Research in such direction was carried out by Raafat et al. (2017) where the statistical approach was used for extracting sensory data. There are existing report of usage of big data analytics towards IoT applications for developing management modeling of smart home appliances Ali et al. (2017) where business intelligence is used. Rehman et al. (2018) and Yang et al. (2017) have also discussed importance of big data analytics using a concentric framework of computation. Zhang et al. (2017) have presented a big data approach for analyzing mobile sensory data with an explicit focus on effective data management. The performance of the analytical operation is reported to be increased by adopting a clustering approach to perform data fusion as seen in the work of Din et al. (2017). The work carried out by Cheng et al. (2017) have discussed complexity in obtaining dataset for addressing energy and accuracy problems. There are also certain kinds of literature where Big Data is said to be efficiently analyzed by adopting a hybridized mechanism of different existing distributed storage models Ebner et al. (2014).Hu et al. (2017) have discussed a scheduling process used for analyzing big data. A study using scheduling approach was also seen to be implemented by Ren et al. (2017) for addressing delay and energy problem during analysis of sensory data. Energy problems during data aggregation using big data approach are presented by Takaishi et al. (2014). A nearly similar form of research towards data aggregation was discussed by Karim and Al-Kahtani (2016) considering data priority. The work carried out by Jeong et al. (2015) has used big data approach for analyzing radiation signals. There are reported work towards enhancing security system on big data analytics Kandah et al. (2017), Zhu et al. (2017), clustering using principal component analysis Li et al. (2016), decision making taking the case study on avionics Miao et al. (2017), analysis of weather information Onal et al. (2017), ecological monitoring Wiska et al.(2016), etc. Therefore, there is various research works towards leveraging big data analytics.

Problem Description

The existing research-based approaches towards big data analytics has witnessed following approaches of implementation e.g. i) hypothetical modeling to address the specific problem associated with an application, ii) more focus on performance enhancement without considering many real-time constraints of sensor network or even IoT, ii) adoption of tools which are already reported to have issues. Such approaches are found not to address the significant amount of data complexity in a cost-effective manner. This leads to questionable facts on the applicability of existing research on practical implementation scenario. Moreover, there is no joint implementation of a significant problem of 5V. All these problems lead to the generation of highly unstructured data that are quite difficult to analyze.

The evolution of Big Data and its associated technologies is not even a half a decade old. Hence, it is quite imperative that management of big data is quite in a nascent stage of research and development. After a series of research work, we explore that all these studies give a very constructive guideline towards addressing problems in Big Data. However, there are some open research issues that have to be identified in order to get it addressed in upcoming research work. Following are the outlines of the open research issues in brief:

Lesser Focus on Algorithm Complexity: It is widely known that low-powered communicating devices are responsible for generating a massive amount of data. So, various research works discussed till date should find its point of implementation which could be either network-based protocols or system-based protocols. Network-based approaches don’t have any dependencies towards devices but system-based approaches of big data management do have significant dependencies. The workings of sophisticated mining algorithms are normally considered to be residing on the device and in such condition, there are possibilities of a device being over-burdened by an algorithmic operation. This could be ensured if the algorithms do possess very low time and space complexity. Unfortunately, all the existing research approaches were found lack of comprehensive algorithm complexity testing, which doesn’t give a clue about the adaptability of an algorithm on low-powered devices.

Need of more effective Optimization Technique: We find that majority of the existing optimization techniques uses convex optimization in order to solve the problem related to the performance of big data. While many researchers have used open source distributed software along with this, it was found that it doesn’t solve the scheduling configuration problem pertaining to big data in the cloud. Another bigger problem is optimizing the clustering technique during performance optimization. There is no doubt that various potential clustering techniques have been significantly investigated during the process of incorporating effective mining operation. However, the effectiveness of such clustering approaches is never found well defined in order to establish concrete research problems in cloud computing. From a software engineering viewpoint, an adoption of conventional software architecture is used in order to cater up the certain complex requirement of data processing. Unfortunately, there are no such techniques discussed till date that solves such a problem. In order to solve scheduling configuration problem, it is required that a converging point of the algorithm is incorporated with a certain level of intelligence and not by using hard-coded values. An extensive investigation is required to look at the algorithm that constructs objective function and controls the objective function based on the dynamic environment of a data stream. In such way, the heterogeneity and veracity problems of big data can be solved.

Lesser Extent of Benchmarking: There is already availability of different forms of Big Data benchmarks. The process of benchmarking big data follows planning, generating data, generate test, execution, and analysis/evaluation. Unfortunately, a majority of the existing studies are not found to use this. Another way of exhibiting effective research approach is to perform comparative analysis. We find that existing research techniques with a very lesser extent of comparative analysis which makes it a bit uncertain about the applicability of the research work in a different environment or varied problem. Adoption of the varied performance parameter for similar research problem in big data research is another impediment to benchmarking of existing research techniques

Research method and specification

The prime objective of the proposed research work is basically to introduce an alternative solution for addressing the problems associated with variety associated with text data. This problem is mainly directed towards solving unstructured data problem in big data. It is meant for addressing the similar problems in IoT applications for performing data processing in order to make the data eligible for mining. The proposed schematic diagram is shown in Fig.2

The design of the proposed system is carried out using an analytical approach where an algorithm is constructed to carry out an effective transformation operation. This process contributes to convert the unstructured to semi-structured and finally to structured data using a proposed technique, where a simple mining operation is also introduced to extract the significant attributes from sensory data. The design of the proposed system is constructed considering the massive text data with a specific number of fields. The prime objective of the proposed algorithm is basically to address the problems associated with data variables associated with the text data. The phases of the algorithm construction are as briefed with reference to Fig.2:

Constructing the fields and its values: The term field refers to a specific category of text data. It is assumed that a single data has numerous fields f1, f2, …..fn which bears discrete information about the text data captured during the data aggregation process, where n is the maximum value of the field. As there are repeated fields present in the database, so for effective analysis, proposed system groups the fields as g = {f1, f2, ….f9} and empirical representation of the group in database db1 is Gd1={g1, g2, ….gm}, where m is the total number of groups (m>>n).

Extraction of Field elements: As it is assumed that all the streams of text data are collected over cloud environment; hence, it is essential to discretize the fields accurately for proper identification of the domain information. Hence, the proposed system extracts Gd1 that is followed by the extraction of all the respective fields within each group. It is quite likely that the number of groups may quite differ in a consecutive database.

Obtaining Semi-Structured Data: One of the essential contributions of the proposed system is its capability to yield semi-structured data. This process is also followed by a simple transformation process that significantly reduces the complexity associated with the text data.

Applying Tokenization & Significant text Data: Applying tokenization will mean to extract the terms corresponding to fundamental grammatical syntax. A majority of the sensory data is in the form of strings, hence, it will offer a better form of inference to extract the meaning followed by extraction of significant text data at the end.

Remember: This is just a sample from a fellow student.

Your time is important. Let us write you an essay from scratch

experts 450+ experts on 30 subjects ready to help you just now

delivery Starting from 3 hours delivery

Find Free Essays

We provide you with original essay samples, perfect formatting and styling

Cite this Essay

To export a reference to this article please select a referencing style below:

Application of Text Mining for Converting Text Data to Structured Format. (2019, May 14). GradesFixer. Retrieved August 6, 2022, from
“Application of Text Mining for Converting Text Data to Structured Format.” GradesFixer, 14 May 2019,
Application of Text Mining for Converting Text Data to Structured Format. [online]. Available at: <> [Accessed 6 Aug. 2022].
Application of Text Mining for Converting Text Data to Structured Format [Internet]. GradesFixer. 2019 May 14 [cited 2022 Aug 6]. Available from:
copy to clipboard

Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.


    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts


    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.



    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!


    Hi there!

    Are you interested in getting a customized paper?

    Check it out!
    Don't use plagiarized sources. Get your custom essay. Get custom paper

    Haven't found the right essay?

    Get an expert to write you the one you need!


    Professional writers and researchers


    Sources and citation are provided


    3 hour delivery