how to build a data lake on-premise

The storage that I would build to meet these requirements, like, itd be an object store, itll be capable of many things and lets see what are those capabilities that we should bring here. Buy vs build: 11-point decision framework. Great. [00:27:30] All right, let me get up set here. Thank you so [00:31:30] much for the session. Lets see here. Okay. With that well switch over to Q and A. raid array of SSDs is that direct attached storage architecture that we spoke about you, where you have compute and storage co-located in a single node and youre just scaling the nodes, and we spoke about some of the disadvantages of that, right? Youd have to have people who are extremely technically smart to manage petabytes and petabytes [00:35:00] of data storage, and with Pure you can just literally set it and forget it will just exist and you dont have to manage it, you dont have to tune performance to it, its just going to keep delivering that simplicity performance and scale simply, thats it. Youd have to have people who are extremely technically smart to manage petabytes and petabytes [00:35:00] of data storage, and with Pure you can just literally set it and forget it will just exist and you dont have to manage it, you dont have to tune performance to it, its just going to keep delivering that simplicity performance and scale simply, thats it. So lets start with an architecture or [marketecture 00:10:19] diagram, which shows the various layers, should on top, you have your applications, right? Yeah. So [00:00:30] were going to start with some challenges with legacy data Structure today, and how a modern data architecture solves some of these problems, and then were going to talk about some requirements for modern infrastructure to create these Cloud Data Lakes on premises, and finally, well show you how to accelerate data insights at your organization and finally, well conclude with some pointers to where you can find out more technical [00:01:00] in-depth resources about what Im talking about to kind of show you some of the proof points and some of the examples of how other people have done it.So lets get started. Yeah. All right. And by the way, he corrected himself instead of B-O-C, it was B-O-X.

He has held various positions formerly at Databricks, Splunk, and Cisco Systems. MinIO is just like the quick and cheap, dirty version of that and its basically yes, but you could use our Portworx software to completely manage all your storage. [00:05:00] Thats going to be a problem for you to be agile and create value quickly to your end user teams. So what you want to do is you want to run some nodes and just run the operating system and the basic functions on the local SSDs on the local drives and you want to keep all your data on a centralized object store file in object store that we call UFFO and that way you create an open data layer that can be used by any application, so youre not locking yourself into silos, to me, thats the difference between flash blade and RAID array of SSDs.

Its going to bring you consistent desegregation, green compute and storage, so you can bring a lot of compute to us, a problem for few minutes, and then take it away for another problem, right? So let me introduce to you Unified Fast File and Object Storage, this is not a name of a product, this is what we call a storage platform that meets those requirements that I just outlined, right? A lot of this content was developed by Joshua Robinson, whos a chief [00:27:00] technology at Pure, and hes written a very detailed blog describing it. with containers, you get that elasticity and agility that you need. So lets get started. Love podcasts or audiobooks? A Brief Lesson on CombinatoricsBasic Counting Principles, My journey becoming a Unity game developer: Game Over Cutscene-Cinematic cut & Dolly Track setup, Applying Domain-Driven Design with Salesforce. If you have more nodes, then you have frequent failures, so hundreds of nodes its again, managing hundreds of nodes [00:06:30] is complex, patching them and securing them, and theres going to be lots of failures happening all the time, either one has its problems. Lets go ahead and open it up for Q and A. Theres no need for any of that. It also needs a dynamic scalability, [00:17:30] so as you scale data, you usually are faced with more complexity and more like performance issues and you dont want to deal with that because you scale your data, you dont want to have downtime, and the performance goes up with scale.

So basically what the users [00:33:30] asking where the customer is asking for it is basically, hows it different from just like, Once you have SSDs just put together. Right? So this is something thats essential to create that architecture that you need for todays modern data services.So lets talk about how in the context of Dremio how these applications and this architecture is going to help you. It can be used for building automating, protecting your cloud native applications, would module to just core storage, backup, disaster recovery, application, data migration, security, and infrastructure automation, all of that is taken care [00:21:00] with this a hundred percent software solution called Kubernetes data services platform. So please join us there or you can go to the [inaudible 00:31:20] and check out the booths and the demos that youll find there and some awesome giveaways and thanks everyone and enjoy the rest of the conference and thanks Naveen. It could be some kind of deep learning software, you have to keep performance tuning and users are always complaining about query speeds and not being there or some something not functioning, so you have to keep performance tuning.All of these cost complexity, and you guys are well aware of that. All right. All right. I dont know what BOC stands for?Naveen: Well try and get that [00:32:30] over Slack, maybe.Dave: Yeah. And so this diagram brings the whole architecture, it shows the application layer, we spoke about, the Kubernetes layer and then the data management services, Kubernetes can be industry. Thats the old architecture old way of doing things, [00:28:30] the hyper-converge architecture, where you have a fixed amount of compute and a fixed amount of storage, and if you need to add storage, the compute just comes along with it.Lets say I have very little queries, but Im getting more data, I need to add storage, Id have to add couple of extra nodes there, right. MinIO is just like the quick and cheap, dirty version of that and its basically yes, but you could use our Portworx software to completely manage all your storage. Theres like MLOps thats also super big buzz word in the industry right now. Folks, if you have any other questions, lets go ahead and get them in. So box.Naveen: [crosstalk 00:35:52] Thats essentially what I answered. So lets double click into that storage layer, Im from Pure Storage, obviously Im going to double click into that storage layer and just find out like, What are some [00:12:00] of the requirements of that storage layer in this modern data analytics world? And what are some of the key drivers in market drivers for this layer, for data today, actually just not the storage layer, just what are the key market drivers for modern data delivery today. So box. Good afternoon. This is the new world that were headed into, open data world.So, weve kind of spoken in theory about the various aspects, we spoke about the challenges we spoke about how open data architectures are addressing some of the challenges. Learn on the go with our new app. And again, go check out that Field Day by Brian gold, from Pure Storage And hes going to explain how we built this ground up architecture to scale. Okay. So you can do this over time and making sure that youre [00:23:30] leveraging the latest S3 protocols and at the same time, keeping your users happy with no zero downtime.And so this diagram brings the whole architecture, it shows the application layer, we spoke about, the Kubernetes layer and then the data management services, Kubernetes can be industry. They say, Were using MinIO would that work with this type of platform?. Its on medium, I can throw that in the chat, and we also have a glossy solution sheet to kind of walk you through the solution and what are some of the benefits to you. Whether its an AIML project or just BI dashboarding or something else.And finally, you want to have open data systems, so you dont get locked into a tool and then later find out that, I have to go to another tool, and now I have to migrate all this data, which is pretty painful. So [00:03:30] you may be in another year or two years, you may be working on tools that are yet to be invented. Unified Fast File and Object slash grid helps you bridge the gap between existing infrastructure, which is maybe an HDFS cluster to a modern data lake. It could be some kind of deep learning software, you have to keep performance tuning and users are always complaining about query speeds and not being there or some something not functioning, so you have to keep performance tuning. Its on medium, I can throw that in the chat, and we also have a glossy solution sheet to kind of walk you through the solution and what are some of the benefits to you. And finally, each piece of analytics software you have in your pipeline, whether its Spark or Splunk [00:07:00] or Elastic or whatever, it may be Dremio. Im sure youve seen several slides like this throughout this conference and everybody starts with one of these slides and everybody knows that todays environment is in silos, you have data warehouses, you have a team working on streaming analytics, theres a backup copy [00:01:30] of some data somewhere, Data Lake, theres a team working on AI and ML, and many times you have to create copies of your data into all of these different environments and these different environments have different teams managing them, it has different levels of service, it has different reliability standards and has different security, right? Theres actually not only storage going to do it, the storage [00:35:30] compute and networking built into every blade and flash blade so that you have that linear increase in performance as you scale and youll see the details in that video. Lets see here. Second thing, theres a lack of agility, requirements change on you at any given time, when you start building something and the requirements change, the tools change, so if your infrastructure is rigid, if your data is rigid and you have a certain set of resources allocated to you, Oh, youve got this 10 nodes and youve got two terabytes of data. Thats all you have and you have to work within rigid infrastructure. Whether its an AIML project or just BI dashboarding or something else. Object storage, but you may have legacy software that may be using NFS, or even current software using NFS are SMB protocol, so you want whatever the protocol that your application is using to access that data, that protocol should be available.And also it should be native to that platform so the performance is good, no matter what protocol the application is using [00:18:30] to access data. Its like, you take a bunch of SSDs put it together in a box and it becomes a FlashBlade, right? azure datalake Below the Kubernetes layer, youre going to have a layer that says Thats for data management services for Kubernetes. So the data management services for Kubernetes its going to as a container is spun up, spun down, the data management services layer is going to provide the storage to do the Kubernetes layer, and then youre going to have a layer, [00:11:30] which is your modern data lake layer, which is based on open data formats, and this software layer, or this layer is going to be built on top of Block or ObjectStore, or it could be more legacy systems, its going to be built on a [inaudible 00:11:49] . Theres a lot of design that has gone into flashlight to build, to create the three things, right? We all know this is not a way we want to be, we all want to shift to something thats [00:02:00] more sane, secure, reliable that we can take from experimentation to production pretty rapidly, and so everybody knows this, youve seen many slides like this, and so what is the state that we want to be today?And I know DataOps is a very buzzy word right now. This is a fantastic shift, it really brought in the elasticity and agility to the cloud world.What were seeing in 2020 beyond, especially with innovators, just like Dremio is youre [00:09:00] seeing Cloud Data Lakes that are built on open data, where theres a separation between compute and data, where you have a open data layer on top of your storage that may be built on open metadata standards, open file formats like parquet and open table format, suggest to data lake and, and Iceberg and other data formats, and then youve got this open data layer on top of your storage [00:09:30] and then that open data layer is accessed by various applications via Dremio, Spark or [inaudible 00:09:37] or whatever the application may be. Or what are some of the challenges that we face today? [00:26:00] And finally its simplifies operations as you scale, like I said, Pure [inaudible 00:26:05] is managed from the cloud, it can be consumed as a service, its completely storage as a service, you only pay for what you use and you never have to be down for any upgrade patching, and even if you need to do a controller upgrade, thats all covered with Pures Evergreen guarantee. We all know this is not a way we want to be, we all want to shift to something thats [00:02:00] more sane, secure, reliable that we can take from experimentation to production pretty rapidly, and so everybody knows this, youve seen many slides like this, and so what is the state that we want to be today? I can try to answer if people are there, I can still try and answer that question. Need an App Maker? So you can bring compute to whatever you need, rather than allocating specific compute silos. So, weve kind of spoken in theory about the various aspects, we spoke about the challenges we spoke about how open data architectures are addressing some of the challenges. azure datalake And I know DataOps is a very buzzy word right now. Hey guys. One is most people buy FlashBlade for simplicity, you just put a bunch of SSDs together, you need to manage those, performance is going to be when you said performance is going to be inconsistent, you have to tune it for the different application workloads. Here Ill do it for you, Ill paste that question into your Slack channel and Ill post the link, just give me a second here. Good afternoon. [inaudible 00:32:35] Slack. So MinIO is similar to FlashBlade, except FlashBlade is [inaudible 00:29:53] software. Or what are some of the challenges that we face today? For those of you who are still here, it looks like theres still a 20 so people here, just in the chat here.Naveen: Yeah. So basically what the users [00:33:30] asking where the customer is asking for it is basically, hows it different from just like, Once you have SSDs just put together. Right? So this is what most data teams want and we know that, but what are the infrastructure challenges that are sort of preventing us from getting there? FlashBlades is ground up built to be a very reliable and [00:30:00] performant object [inaudible 00:30:05] store. Again, if you have any questions, you can use the button in the upper right hand corner to share your audio or video, and youll automatically be put into a queue, and if for some reason youre having trouble with that, you can just ask your question in the chat. A lot of this content was developed by Joshua Robinson, whos a chief [00:27:00] technology at Pure, and hes written a very detailed blog describing it. You want faster time to insight, you want to build these pipelines, to create business value, right? And if youre using multiple clusters, different types of clusters you may be under utilizing resources in one area and over utilizing resources and other area, you cannot keep trying to rebalance those. So [00:33:00] all right, let me copy the link. [00:36:30] See you on slack. Lets get started with the agenda. It is difficult to plan capacity ahead of time when you plan for something and then you add a node or removal node from a cluster, suddenly your data starts rebalancing and you have to move data from one location together, you need to install a patch, its just complex, [00:05:30] and the complexity scales with the data, so you start with a few terabytes of data or less than that and you start scaling to more users, you start scaling to more data, you start scaling to more clusters and nodes, and what happens is, complexity goes through the roof along with your scale. FlashBlade is completely managed from the cloud, so as you want to add capacity, you just keep slipping in new blades and it just adds capacity with no downtime, and its super simple, theres no need for tuning. It works with any storage, any infrastructure, with FlashBlade, youre getting a much better version, much more ground up built version [00:30:30] for your specific needs, low latency and other characteristics, simplicity and other characteristics.Dave: Okay. Okay, so weve got a couple of questions. You had nodes, these hyper-converged nodes, and youve given a certain number of nodes for a particular application, whether its Hadoop or Spark or whatever [00:08:00] application that may be and you had these nodes that you just [inaudible 00:08:06] to scale like hundreds of nodes to 200 nodes, 300 nodes.And in 2015 to 2020, we moved into this cloud data warehouse world, where you were in a cloud, theres separation of computing storage, so the whole storage became a sort of in a cloud S3 layer, and you had cloud data warehouses, which would [00:08:30] separate compute, so bring compute to a query, and if youre in a cloud it would bring unlimited compute with cloud to a particular query for a few minutes, and then spin it down when you dont need it. We got one more question here.

The second, it needs to be an intelligent architecture built up on todays technologies, todays storage demands flash, [00:16:00] right? If you have higher capacity nodes, its going to cost more rebalancing. [00:26:30] And finally, you can support all these multi-tenant applications scale and make everything self-service and thats our vision, to make analytics and AI scalable self-service and automated. His team curates best practices to simplify management while delivering performance at petabyte-scale for software such as Elastic Search, Apache Spark, Apache Kafka, Tensorflow etc. Lets go ahead and open it up for Q and A. And finally multi-protocol support, you dont want to bank all your dollars on one particular protocol. So please join us there or you can go to the [inaudible 00:31:20] and check out the booths and the demos that youll find there and some awesome giveaways and thanks everyone and enjoy the rest of the conference and thanks Naveen.Naveen: Thank you so [00:31:30] much for the session. with containers, you get that elasticity and agility that you need. So [00:00:30] were going to start with some challenges with legacy data Structure today, and how a modern data architecture solves some of these problems, and then were going to talk about some requirements for modern infrastructure to create these Cloud Data Lakes on premises, and finally, well show you how to accelerate data insights at your organization and finally, well conclude with some pointers to where you can find out more technical [00:01:00] in-depth resources about what Im talking about to kind of show you some of the proof points and some of the examples of how other people have done it. Finally, from an organizational perspective, from an environmental perspective, youre seeing security becoming a big concern because that data is now the new oil that is your IP and you have to protect [00:14:00] it is ransomware attacks everywhere, locking up your data and demanding ransom and so you want to keep it safe. As you support various use cases, more data sources going from simple dashboards to machine learning, to actual [productionizing 00:25:31] [00:25:30] machine learning based software, right? So thanks again Naveen for sticking around a little extra and thanks for your talk. [00:34:30] We have several large fan companies using petabytes and petabytes of data to do machine learning on top of Pures storage devices. Modern infrastructure is going to be built on containers, these applications are going to run on containers, and so hopefully you have something thats like a container as a service or a cluster as a service, platform as a service layer that has containers and virtual machines, right?This is what you have in mind, so lets look at [00:11:00] storage and how we bring this paradigm to storage. If we didnt get your questions, I think weve got them all, but if we didnt, then you can hit up Naveen in Slack, [00:31:00] but before you go, we would appreciate it if you would please fill out the super short Slido session survey, which youll find in the chat and the next session is coming up, I think we have a panel actually, or a keynote, a fireside chat, I believe.

When weve kind of divided this into three trends, depending on whether you look at it from a business angle, or you look it from a data angle, [00:12:30] the three trends are first weve got workloads that are shifting towards more AI and ML workloads, your data is more machine generated data. You cannot do that and just manage it with like one or two guys, and forget storage, right? The second shift that were seeing is, of course, with the cloud, people are moving towards object story, object storage with a war on structured data. We want you to keep the storage that you have and never pay for storage that you already bought. All of these new data sources that are growing exponentially are machine generated [00:13:00] and people are doing AI and ML on them.While its clear in 10 years as forward thinking, companies say that most of the code generated would be AI and ML code. I dont know what BOC stands for? The storage that I would build to meet these requirements, like, itd be an object store, itll be capable of many things and lets see what are those capabilities that we should bring here. Theres like MLOps thats also super big buzz word in the industry right now. So lets talk about how in the context of Dremio how these applications and this architecture is going to help you. Good morning. First, we have unpredictable performance, youve got data pipelines that service various teams with various requirements and their jobs [00:04:00] might be slow, their queries might be slowing them down, anybody that has a query thats stuck is going to just give up and not use the system, right?

Sitemap 4

mountain warehouse shorts