In this article, we will figure out what is considered Big Data and what is not, how to store, process and benefit from this information.
Table of Contents
Definition of Big Data
These are Petabytes (and more) of complex and raw information that is constantly being updated .
For example, data from IoT sensors from industrial equipment in factories, records of transactions of bank customers or searches from different devices. Sometimes processing methods and technologies are added to big data .
The concept of “big data” (big data) appeared in 2008, but even before the definition appeared with big data. For example, business analysts at VimpelCom worked with big data in 2005, according to Viktor Bulgakov, head of the management information department.
To more accurately understand whether the data belongs to big data or not, they look at the properties of information (properties were determined by the Meta Group in 2001):
- Volume – volume (about 1 Petabyte).
- Velocity is a regular update.
- Variety – data may not be structured or have heterogeneous formats.
Two more factors are often added to the listed factors:
- Variability – spikes and dips in data that require certain technologies to process.
- Value – different complexity of information. For example, data about users of social networks and information about transactions in the banking system have different levels of complexity.
Note . The definitions are conditional because no one knows exactly how to define big data. Some Western experts even believe that the term has been discredited and suggest that it be abandoned.
Visit here: Top Big Data Companies
How Big Data is collected
Sources can be:
- the Internet – from social networks and media to the Internet of Things (IoT);
- corporate data: logs, transactions, archives;
- other devices that collect information, such as smart speakers.
Collection . Technology and data collection process itself is called data mining (data mining).
The services through which the collection is carried out are, for example, Vertica, Tableau, Power BI, Qlik. The collected data can be in different formats: text, Excel tables, SAS.
In the process of collecting, the system finds Petabytes of information, which will then be processed by the methods of intellectual analysis , which reveals patterns. These include neural networks, clustering algorithms, algorithms for detecting associative links between events, decision trees, and some machine learning methods.
Briefly, the process of collecting and processing information looks like this:
- the analytical program receives the task;
- the system collects the necessary information, at the same time preparing it: deletes irrelevant information, clears garbage, decodes;
- a model or algorithm for analysis is selected;
- the program learns the algorithm and analyzes the found patterns.
How Big Data is stored
Most often, “raw” data is stored in a data lake – a “data lake”. At the same time, they are stored in different formats and degrees of structuredness:
- rows and columns from the database – structural;
- CSV, XML, JSON files, logs – semi-structured;
- documents, mail messages, pdf – unstructured;
- video, audio and images are binary.
Different tools are used to store and process information in the data lake:
- Hadoop is a data management platform. Contains one or more clusters. Typically used to process, store, and analyze large amounts of non-relational data: log files, web traffic records, sensor data, JSON objects, images, and social media messages.
- HPPC (DAS) – Developed by LexisNexis Risk Solutions. It is a supercomputer that processes information both in batch mode and in real time.
- Storm is a real-time information processing framework developed in Clojure.
Data lake is not only storage. The “lake” can also include a software platform, for example, Hadoop, clusters of storage and processing servers, tools for integrating with sources and consumers of information and systems for data preparation, management and sometimes machine learning tools. Also, the “data lake” can be scaled up to thousands of servers without stopping the cluster.
From the lake, information flows into the “sandboxes” – areas of data exploration. At this stage, scenarios are developed to solve various business problems.
Data lake is more often located in the cloud than on its own servers. For example, 73% of companies use cloud services to work with big data, according to the report “ Overview of Trends and Issues of Big Data 2018 ”. Big data processing requires a lot of computing power, and cloud technologies can reduce the cost of work, so companies resort to these storages.
Cloud technologies can become an alternative to your own data service, because it is difficult to predict the exact load on the infrastructure. If you buy equipment “in reserve”, then it is idle and causes losses. And if the equipment is low-powered, it will not be enough for storage and processing.
- The cloud can store more data than physical servers: there will be no end of storage space.
- The company can create its own cloud structure or lease capacity from a provider.
- The cloud is cost-effective for companies with rapidly growing workloads or businesses where various hypotheses are often tested.
How big data works
When the data is received and saved, it must be analyzed and presented in a form that is understandable for the client: graphs, tables, images or ready-made algorithms. Traditional methods are not suitable due to the volume and complexity of processing. With big data, you need to:
- process the entire data array (and these are Petabytes);
- search for correlations throughout the array (including hidden ones);
- process and analyze information in real time.
Therefore, separate technologies have been developed to work with big data.
Technologies
Initially, these are tools for processing indefinitely structured data: NoSQL DBMS, MapReduce algorithms, Hadoop.
MapReduce is a framework for parallel computing of very large datasets (up to several Petabytes). Developed by Google (2004).
NoSQL (from English Not Only SQL, not only SQL). Helps to work with disparate data, solve scalability and availability problems by using data atomicity and consistency.
Hadoop is a project of the Apache Software Foundation. It is a set of utilities, libraries and frameworks that is used to develop and run distributed programs running on clusters of hundreds and thousands of nodes. We have already talked about it, but this is because almost no project related to big data can do without Hadoop.
Technologies also include the R and Python programming languages, Apache products.
Methods and tools for working with big data
These are data mining, machine learning, crowdsourcing, predictive analytics, visualization, simulation. Dozens of techniques:
- mixing and integration of heterogeneous data, for example, digital signal processing;
- predictive analytics – uses historical data and predicts future events;
- simulation modeling – builds models that describe processes as if they were happening in reality;
- spatial and statistical analysis;
- visualization of analytical data: pictures, graphs, diagrams, tables.
For example, machine learning is an AI method that teaches a computer to “think” on its own, analyze information and make decisions after learning, rather than following a human-programmed command.
Learning algorithms need structured data from which the computer will learn. For example, if you play checkers with a machine and win, then the machine remembers only the correct moves, but does not analyze the game process. If you leave the computer to play with itself, then it will understand the course of the game, develop a strategy, and a living person will start losing to the machine. In this case, she does not just make moves, but “thinks”.
Deep learning is a separate type of machine learning, during which new programs are created that are capable of self-learning. And here artificial neural networks are used that mimic human neural networks. Computers process unstructured data, analyze, draw conclusions, sometimes make mistakes and learn – almost like humans.
The result of deep learning is used in image processing, speech recognition algorithms, computer translation and other technologies. The pictures drawn by Yandex neural networks and Alice’s witty answers to your questions are the result of deep learning.
Data Engineer
This is already the “human” part of working with big data. A Data Engineer or data engineer is a data processor. He prepares the infrastructure for work and data for the Data Scientist:
- develops, tests and maintains databases, storage and bulk processing systems;
- cleans and prepares data for use – creates a data processing pipeline.
After the Data Engineer, Data Scientist steps in: creates and trains predictive (and not only) models using machine learning algorithms and neural networks, helping businesses find hidden patterns, predict the development of events and optimize business processes.
Where Big Data is used
The main principle of big data is to quickly give the user information about objects, phenomena or events. To do this, machines are able to build variable models of the future and track results, which is useful for commercial companies.
Banks
The banking industry uses big data technologies for fraud prevention, process optimization and risk management. For example, VTB, Sberbank or Tinkoff are already using big data to check the reliability of borrowers (scoring), manage staff and predict queues at branches.
Collecting big data helps to more accurately assess the client’s risk profile, which ultimately reduces the likelihood of loan defaults.
Tinkoff uses EMC Greenplum, SAS Visual Analytics and Hadoop to analyze risks, identify customer needs, and leverage big data in scoring, marketing and sales.
VTB uses big data to make decisions about opening new offices. The bank has created its own internal geo-analytical platform. Machine learning methods have made it possible to identify the demand for banking services in different areas of the city.
Business
The choice of a business development strategy is based on the results of information analysis. Here, big data will help process huge amounts of data and identify the direction of development. Using the results of the analysis, you can identify which products are in demand in the market, and increase customer loyalty.
Hypermarket Hoff uses big data to create personalized offers for customers.
The CarPrice service reduces costs by optimizing traffic: thanks to big data, the speed of user decision-making has increased, and the quality of service has increased.
The Zarina brand increased its revenue by 28% by personalizing the delivery of recommendations to the customers of the online store.
Here one cannot fail to say about Netflix. Personalization is at its core. The service with a million audience offers content that in 80% of cases relies on the user experience of the viewer and information from Facebook and Twitter. To optimize the search results, the user’s search queries, browsing history, information about repeated views, pauses and rewinds are used. Netflix uses Hadoop, Teradata and proprietary solutions (Lipstick and Genie) to process data.
For example, when Netflix created House of Cards, based on the analysis, it ordered two seasons at once, and not just the pilot. And the series was an overwhelming success: data analysis showed that viewers were delighted with actor Kevin Spacey and producer David Fincher.
Marketing
Big data provides a great toolbox for marketers. Data analysis helps to identify customer needs, test new ways to increase loyalty and find which products will be in demand.
For example, the RTB service helps you set up retargeting: cross-channel, search, and product retargeting. So companies can advertise products not to everyone, but only to the target audience.
Services Crossss, Alytics, 1C-Bitrix BigData allow conducting end-to-end analytics, increasing the average check, increasing ad conversion, and increasing the personalization of offers. And all this with the help of big data.
Problems and prospects of Big Data
The problems are the amount of information, processing speed and lack of structure.
Storing large amounts of data requires special conditions, while processing speed requires new methods of analysis. There is still no sufficient practice of accumulating big data in the world. At the same time, the data is scattered and sometimes unreliable, which interferes with effectively solving business problems.
The big data industry is just gaining momentum and there is not enough specialists, for example, Data Engineer, because this profession did not exist recently.
Perspectives . Big data is evolving: it helps to recognize fraud in banks, calculate the effectiveness of advertising campaigns, recommend a movie, and even diagnose a patient based on the collected anamnesis. Banks, process manufacturing and companies from the professional services industry invest the most in big data.
In 2016, the volume of the world market for software, equipment and services in the field of business intelligence and work with big data amounted to $ 130.1 billion, of which $ 17 billion is the banking sector. The share of investments from government bodies and commercial companies was measured at approximately 7.5%. In 2018, revenue from sales of programs and services in the global market in 2018 amounted to $ 42 billion, and the market is only growing.
Experts believe that the technology will soon be used in the transport sector, oil production, and energy. IDC predicts that revenues related to big data will exceed $ 260 billion by 2022 with an annual market growth of 11.9%. The largest market segments will be manufacturing, finance, healthcare, environmental protection and retail, according to Frost & Sullivan forecasts.
The development of big data will change our daily life. The systems will be able to analyze daily routes, frequent orders and recurring payments. Probably, in the future, technologies will make it possible to automatically pay for loans and utilities, call a car from work to home, where dinner from your favorite dishes will already be ready on the table.