Big Data

The term Big Data became popular around 2010/2011 as a way to denote very large data sets of magnitudes that were not commonly analyzed by business analysts before. Thanks to the progress in hardware and software, such very large data sets are now within the reach of not only scientists but also practitioners in many companies who use them for data mining and modeling. Big Data sets are usually larger than two gigabytes (1 gigabyte = 1,000 megabytes), which is the file size limit of 32-bit operating systems, most commonly used in business applications before they were replaced by the 64-bit OSs at the end of the first decade of the 21st century. Some data repositories may grow to thousands of terabytes, i.e., to the petabyte range (1 petabyte = 1,000 terabytes). Beyond petabytes, data set size can be measured in exabytes (1 exabyte = 1,000 petabytes = one quintillion bytes). For example, the manufacturing sector worldwide in 2010 is estimated to have stored a total of 2 exabytes of new information (Manyika et al., 2011).

1 terabyte = 1,000 megabytes

1 petabyte = 1,000 terabytes

1 exabyte = 1,000 petabytes (one quintillion bytes)

In many cases, when the data set is expected to be homogenous and no subset detection or stratification analysis is planned, it is not necessary to process data sets of Big Data magnitude because statistical sampling can produce identical results much faster (sometimes thousands of times faster). However, the important advantage of being able to process the complete, very large data sets without sampling involve 1) the ability to identify relatively small homogenous subsets that reveal distinctive and identifiable patterns of interest, and 2) the ability to detect rare anomalies.