Thursday, October 3, 2019
Comprehensive Study on Big Data Technologies and Challenges
Comprehensive Study on Big Data Technologies and Challenges Abstract: Big Data is at the heart of modern science and business. Big Data has recently emerged as a new paradigm for hosting and delivering services over the Internet. It offers huge opportunities to the IT industry. Big Data has become a valuable source and mechanism for researchers to explore the value of data sets in all kinds of business scenarios and scientific investigations. New computing platforms such as Mobile Internet, Social Networks and Cloud Computing are driving the innovations of Big Data. The aim of this paper is to provide an overview of the concept Big Data and it tries to address various Big Data technologies, challenges ahead and possible. It also explored certain services of Big Data over traditional IT service environment including data collection, management, integration and communication Keywordsââ¬â Big Data, Cloud Computing, Distributed System, Volume I. INTRODUCTION Big Data has recently reached popularity and developed into a major trend in IT. Big Data are formed on a daily bases from Earth observations, social networks, model simulations, scientific research, application analyses, and many other ways. Big Data is a data analysis methodology enabled by a new generation of technologies and architecture which support high-velocity data capture, storage, and analysis. Data sources extend beyond the traditional corporate database to include email, mobile device output, sensor-generated data, and social media output. Data are no longer restricted to structured database records but include unstructured data. Big Data requires huge amounts of storage space. A typical big data storage and analysis infrastructure will be based on clustered network-attached storage. This paper firstly defines the Big Data concept and describes its services and main characteristics. ââ¬Å"Big Dataâ⬠is a term encompassing the use of techniques to capture, process, analyze and visualize potentially large datasets in a reasonable timeframe not accessible to standard IT technologies. II. Background Need of Big Data Big Data refers to large datasets that are challenging to store, search, share, visualize, and analyze the data. In Internet the volume of data we deal with has grown to terabytes and peta bytes. As the volume of data keeps growing, the types of data generated by applications become richer than before. As a result, traditional relational databases are challenged to capture, share, analyze, and visualize data. Many IT companies attempt to manage big data challenges using a NoSQL database, such as Cassandra or HBase, and may employ a distributed computing system such as Hadoop. NoSQL databases are typically key-value stores that are non-relational, distributed, horizontally scalable, and schema-free. We need a new methodology to manage big data for maximum business value. Data storage scalability was one of the major technical issues data owners were facing. Nevertheless, a new brand of efficient and scalable technology has been incorporated and data management and storage is no longer the problem it used to be. In addition, data is constantly being generated, not only by use of internet, but also by companies generating big amounts of information coming from sensors, computers and automated processes. This phenomenon has recently accelerated further thanks to the increase of connected devices and the worldwide success of the social platforms. Significant Internet players like Google, Amazon, Face Book and Twitter were the first facing these increasing data volumes and designed ad-hoc solutions to be able to cope with the situation. Those solutions have since, partly migrated into the open source software communities and have been made publicly available. This was the starting point of the current Big Data trend as it was a relatively cheap solution f or businesses confronted with similar problems. Dimensions of Big Data Fig. 1 shows the four dimensions of Big Data. They are discussed below. Fig. 1 Dimensions of Big Data Volume refers that Big Data involves analyze huge amounts of information, typically starting at tens of terabytes. It ranges from terabytes to peta bytes and up. The noSQL database approach is a response to store and query huge volumes of data heavily distributed. Velocity refers the speed rate in collecting or acquiring or generating or processing of data. Real-time data processing platforms are now considered by global companies as a requirement to get a competitive edge. For example, the data associated with a particular hash tag on Twitter often has a high velocity. Variety describes the fact that Big Data can come from many different sources, in various formats and structures. For example, social media sites and networks of sensors generate a stream of ever-changing data. As well as text, this might include geographical information, images, videos and audio. Veracity includesknown data quality, type of data, data management maturity so that we can understand how much the data is right and accurate 000,000,000,000,000,000,000 bytes Big Data Model The big data model is an abstract layer used to manage the data stored in physical devices. Today we have large volumes of data with different formats stored in global devices. The big data model provides a visual way to manage data resources, and creates fundamental data architecture so that we can have more applications to optimize data reuse and reduce computing costs. Types of data The data typically categorized into three differà ent types ââ¬â structured, unstructured and semi-structured. A structured data is well organized, there are several choices for abstract data types, and references such as relations, links and pointers are identifiable. An unstructured data may be incomplete and/or heterogeneous, and often originates from multiple sources. It is not organized in an identifiable way, and typically includes bitmap images or objects, text and other data types that are not part of a database. Semi-structured data is orgaà nized, containing tags or other markers to separate semantic elements, III. Big Data Services Big Data provides enormous number of services. This paper explained some of the important services. They are given below. Data Management and Integration An enormous volume of data in different formats, constantly being collected from sensors, is efficiently accumulated and managed through the use of technology that automatically categorizes the data for archive storage. Communication and Control This comprises three functions for exchanging data with various types of equipment over networks: communications control, equipment control and gateway management. Data Collection and Detection By applying rules to the data that is streaming in from sensors, it is possible to conduct an analysis of the current status. Based on the results, decisions can be made with navigation or other required procedures performed in real time. Data Analysis The huge volume of accumulated data is quickly analyzed using a parallel distributed processing engine to create value through the analysis of past data or through future projections or simulations. IV. BIG DATA TECHNOLOGIES Internet companies such as Google, Yahoo and Face book have been pioneers in the use of Big Data technologies and routinely store hundreds of terabytes and even peta bytes of data on their systems. There are a growing number of technologies used to aggregate, manipulate, manage, and analyze big data. This paper described some of the more prominent technologies but this list is not exhaustive, especially as more technologies continue to be developed to support Big Data techniques. They are listed below. Big Table: Proprietary distributed database system built on the Google File System. This technique is an inspiration for HBase. Business intelligence (BI): A type of application software designed to report, analyze, and present data. BI tools are often used to read data that have been previously stored in a data warehouse or data mart. BI tools can also be used to create standard reports that are generated on a periodic basis, or to display information on real-time management dashboards, i.e., integrated displays of metrics that measure the performance of a system. Cassandra: An open source database management system designed to handle huge amounts of data on a distributed system. This system was originally developed at Face book and is now managed as a project of the Apache. Cloud computing: A computing paradigm in which highly scalable computing resources, often configured as a distributed system provided as a service through a network. Data Mart: Subset of a data warehouse, used to provide data to users usually through business intelligence tools. Data Warehouse: Specialized database optimized for reporting, often used for storing large amounts of structured data. Data is uploaded using ETL (extract, transform, and load) tools from operational data stores, and reports are often generated using business intelligence tools. Distributed system: Distributed file system or network file system allows client nodes to access files through a computer network. This way a number of users working on multiple machines will be able to share files and storage resources. The client nodes will not be able to access the block storage but can interact through a network protocol. This enables a restricted access to the file system depending on the access lists or capabilities on both servers and clients which is again dependent on the protocol. Dynamo: Proprietary distributed data storage system developed by Amazon. Google File System: Proprietary distributed file system developed by Google; part of the inspiration for Hadoop3.1 Hadoop: Apache Hadoop is used to handle Big Data and Stream Computing. Its development was inspired by Googleââ¬â¢s MapReduce and Google File System. It was originally developed at Yahoo and is now managed as a project of the Apache Software Foundation. Apache Hadoop is an open source software that enables the distributed processing of large data sets across clusters of commodity servers. It can be scaled up from a single server to thousands of clients and with a very high degree of fault tolerance. HBase: An open source, free, distributed, non-relational database modeled on Googleââ¬â¢s Big Table. It was originally developed by Powerset and is now managed as a project of the Apache Software foundation as part of the Hadoop. MapReduce: A software framework introduced by Google for processing huge datasets on certain kinds of problems on a distributed system also implemented in Hadoop. Mashup: An application that uses and combines data presentation or functionality from two or more sources to create new services. These applications are often made available on the Web, and frequently use data accessed through open application programming interfaces or from open data sources. Data Intensive Computing is a type of parallel computing application which uses a data parallel approach to process Big Data. This works based on the principle of collection of data and programs used to perform computation. Parallel and Distributed system that work together as a single integrated computing resource is used to process and analyze Big Data. IV. BIG DATA USING CLOUD COMPUTING The Big Data journey can lead to new markets, new opportunities and new ways of applying old ideas, products and technologies. Cloud Computing and Big Data share similar features such as distribution, parallelization, space-time, and being geographically dispersed. Utilizing these intrinsic features would help to provide Cloud Computing solutions for Big Data to process and obtain unique information. At the same time, Big Data create grand challenges as opportunities to advance Cloud Computing. In the geospatial information science domain, many scientists conducted active research to address urban, environment, social, climate, population, and other problems related to Big Data using Cloud Computing. V. TECHNICAL CHALLENGES Many of Big Dataââ¬â¢s technical challenges also apply to data it general. However, Big Data makes some of these more complex, as well as creating several fresh issues. They are given below. Data Integration Organizations might also need to decide if textual data is to be handled in its native language or translated. Translation introduces considerable complexity ââ¬â for example, the need to handle multiple character sets and alphabets. Further integration challenges arise when a business attempts to transfer external data to its system. Whether this is migrated as a batch or streamed, the infrastructure must be able to keep up with the speed or size of the incoming data. The IT organization must be able to estimate capacity requirements effectively. Companies such as Twitter and Face book regularly make changes to their application programming interfaces which may not necessarily be published in advance. This can result in the need to make changes quickly to ensure the data can still be accessed. Data Transformation Another challenge is data transformation .Transformation rules will be more complex between different types of system records. Organizations also need to consider which data source is primary when records conflict, or whether to maintain multiple records. Handling duplicate records from disparate systems also requires a focus on data quality. Historical Analysis Historical analysis could be concerned with data from any point in the past. That is not necessarily last week or last month ââ¬â it could equally be data from 10 seconds ago. While IT professionals may be familiar with such an application its meaning can sometimes be misinterpreted by non-technical personnel encountering it. Search Searching unstructured data might return a large number of irrelevant or unrelated results. Sometimes, users need to conduct more complicated searches containing multiple options and fields. IT organizations need to ensure their solution provides the right type and variety of search interfaces to meet the businessââ¬â¢s differing needs. And once the system starts to make inferences from data, there must also be a way to determine the value and accuracy of its choices. Data Storage As data volumes increase storage systems are becoming ever more critical. Big Data requires reliable, fast-access storage. This will hasten the demise of older technologies such as magnetic tape, but it also has implications for the management of storage systems. Internal IT may increasingly need to take a similar, commodity-based approach to storage as third-party cloud storage suppliers do today. It means removing rather than replacing individual failed components until they need to refresh the entire infrastructure. There are also challenges around how to store the data whether in a structured database or within an unstructured system or how to integrate multiple data sources. Data Integrity For any analysis to be truly meaningful it is important that the data being analyzed is as accurate, complete and up to date as possible. Erroneous data will produce misleading results and potentially incorrect insights. Since data is increasingly used to make business-critical decisions, consumers of data services need to have confidence in the integrity of the information those services are providing. Data Replication Generally, data is stored in multiple locations in case one copy becomes corrupted or unavailable. This is known as data replication. The volumes involved in a Big Data solution raise questions about the scalability of such an approach. However, Big Data technologies may take alternative approaches. For example, Big Data frameworks such as Hadoop are inherently resilient, which may mean it is not necessary to introduce another layer of replication. Data Migration When moving data in and out of a Big Data system, or migrating from one platform to another, organizations should consider the impact that the size of the data may have. To deal with data in a variety of formats, the volumes of data will often mean that it is not possible to operate on the data during a migration. Visualisation While it is important to present data in a visually meaningful form, organizations need to consider the most appropriate way to display the results of Big Data analytics so that the data does not mislead. IT should take into account the impact of visualisations on the various target devices, on network bandwidth and on data storage systems. Data Access The final technical challenge relates to controlling who can access the data, what they can access, and when. Data security and access control is vital in order to ensure data is protected. Access controls should be fine-grained, allowing organizations not only to limit access, but also to limit knowledge of its existence. Enterprises therefore need to pay attention to the classification of data. This should be designed to ensure that data is not locked away unnecessarily, but equally that it doesnââ¬â¢t present a security or privacy risk to any individual or company. VI. CONCLUSION This paper reviewed the technical challenges, various technologies and services of Big Data. Big Data describes a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture. Linked Data databases will become more popular and could potentially push traditional relational databases to one side due to their increased speed and flexibility. This means businesses will be able to change to develop and evolve applications at a much faster rate. Data security will always be a concern, and in future data will be protected at a much more granular level than it is today. Currently Big Data is seen predominantly as a business tool. Increasingly, though, consumers will also have access to powerful Big Data applications. In a sense, they already do Google and various social media search tools. But as the number of public data sources grows and processing power becomes ever faster and c heaper, increasingly easy-to-use tools will emerge that put the power of Big Data analysis into everyoneââ¬â¢s hands.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.