I heard Data Lakes can store any kind of data: Relational, NoSql , Pictures/images, Adobe Pdf, Excel. How is the data stored, in a No-SQL format, or in binary tree? Or does it just save it like a regular hard drive? If so, why don't they just call it storage, instead of data lake? I am trying to find the exact storage mechanism for 'data lake'
1 Answers
A data lake is a system or repository of data stored in its natural format,[1] usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
Examples: One example of technology used to host a data lake is the distributed file system used in Apache Hadoop.
Many companies also use cloud storage services such as Azure Data Lake and Amazon S3.[9] There is a gradual academic interest in the concept of data lakes, for instance, Personal DataLake[10] at Cardiff University to create a new type of data lake which aims at managing big data of individual users by providing a single point of collecting, organizing, and sharing personal data.[11]
An earlier data lake (Hadoop 1.0) had limited capabilities with its batch oriented processing (MapReduce) and was the only processing paradigm associated with it. Interacting with the data lake meant you had to have expertise in Java with map reduce and higher level tools like Apache Pig & Apache Hive (which by themselves were batch oriented). With the dawn of Hadoop 2.0 and separation of duties with Resource Management taken over by YARN (Yet Another Resource Negotiator), new processing paradigms like streaming, interactive, on-line have become available via Hadoop and the Data Lake.

- 20,759
- 19
- 87
- 200
-
one more question, how is this any different from a hard drive storing different kind of files? or is it just the same, with a marketing trend word? Thanks- – Sep 09 '18 at 17:29
-
Well, a hard-drive is a kind of server. A server is a computer together with the programs needed to share data and tasks with other computers and programs (usually across some form of network). It may include one or more hard drives, just like any computer may also include one or more of its own. Usually a server is dedicated to being used by other computers. And usually it runs without needing a user to constantly tell it what to do. This is the major difference. – ASH Sep 10 '18 at 00:25
-
the definitely of server sounds similar to data lake, with 'hard drive being kind of server' – Sep 10 '18 at 00:27
-
Go ahead and mark my answers as helpful if this indeed did help you. Thanks! – ASH Sep 10 '18 at 00:31
-
ok, will do, I will probably ask another post, asking distinctions between storage server and data lake – Sep 10 '18 at 00:43
-
It's best to ask one specific question at a time, rather than ask one question, and then go off on a few somewhat related tangents, after the question is answered. It helps people who are searching for a specific question, to find a specific solution. – ASH Sep 10 '18 at 02:18
-
If you copy from WiKi, please also add the link https://en.wikipedia.org/wiki/Data_lake – Root Loop Dec 15 '21 at 04:08