Distributed File System

Big data (multiple terabytes, petabytes) can be stored and organized in distributed file systems. While there are many implementations and implementation details, in the most general terms, information is stored on one of multiple (sometimes thousands of) hard drives and standard off-the-shelf computers; an index or map keeps track of where (on which computer/drive) a specific piece of information is stored. Actually, for failover redundancy and robustness, each piece of information is usually stored multiple times, e.g., as triplets.

So, for example, suppose you collect individual transactions in a large retail chain store. Details of each transaction will be stored in triplets on different servers and hard drives, with a master table or map keeping track of where exactly the respective transaction details can be retrieved. By using off-the-shelf standard hardware and open-source software for managing this distributed file system (such as Hadoop), reliable data repositories on the petabyte scale can be achieved relatively easily, and such storage systems are quickly becoming commonplace.