We’ve seen them mentioned in the IT news headlines: data lakes. The new hope of IT organisations to enable their business units with actual content and valuable insights, rather than just offering servers and empty storage. Almost all companies of size and renown have embarked on this new journey and are building these lakes to sail upon them. Or maybe not?
Recent concerns raised around this data lake use case, especially since the dawn of GDPR has made people rethink the share-everything-with-everyone mindset behind these lakes. Also, it has raised the matter of data ownership, retention, deletion and correction. Most data lake scenarios are viewed from a primarily technical perspective because that is where the idea comes from. Inevitably, and luckily not AFTER the actual release of many of these lakes into production, the legal and compliance departments have woken up.
As we are involved in quite a number of these projects, we wanted to share the main aspects to keep in mind when building your data lake with GDPR in mind. So here we go:
You can argue of course that once you work for a company, your data belongs to them. But it’s not that easy. First of all, this is a concept that may or may not apply in some countries. Secondly, the concept of storing employee data is one thing, the idea to use it for analytical purposes may require the employee’s consent. And that is where you run into challenges. For example in Germany, companies all have one thing in common: they have extensive employee data, and are rarely allowed to use it to their advantage because of the current legislation. Through the GDPR introduction, this type of scrutiny will be imposed on all EU countries and hence become a challenge for many more businesses.
This should be the most traditional use case in data privacy and protection and is one of the key reasons why the GDPR debate is so viral and vibrant these days: it concerns almost all companies. There’s a lot to discuss around this particular point, but one specific aspect that is of some note is the “Right to Explanation”. If you use machine learning on user data, GDPR regulations state that “meaningful information about the logic” behind machine learning models must be made available to users.
Many machine learning models are black boxes, but the type of data used to train them should be made clear to users, so that they can make an informed decision to opt out. Users should, at all times be offered the option not to have their data used as part of machine learning and artificial intelligence applications.
With IoT and Connected-X, we all feel like we can’t really participate in modern society without sacrificing some of your privacy tied to devices and gadgets. From a legal perspective, the providers of services ask for your consent when you install mobile apps or sign up for a SaaS-type services. This is the easy part. Now, imagine you are a car manufacturer, who could gain a great deal of insights and competitive advantage through collection of device / car data in that field, and has all the technology to make that happen but is not allowed to do it.
In actual fact, this is an issue. People used to buy cars without signing a data privacy agreement. Recently, privacy agreements have become an actual necessity in order to even operate the connected car services. As a business, you have to always keep in mind that just because it is device data, does not mean you can harvest and use the data for your advantage. There is a human being or an organisation behind that device who’s using it. You need their consent, otherwise no data can be legally processed.
Prevent the Drought
So does that mean there is a chance your data lake could dry out very soon? Don’t worry, here are some relatively easy ways to address this challenge:
Anonymisation of data is one way to solve this. This means that the data is being stripped of all potential identifiers to human beings and actual end user facing devices and collects statistical data for very specific use cases. If that isn’t possible in your given use case it’s a different story. But it must become an inherent part of all the data processing in the solution you design and isn’t bound to the data lake at all - it sits within your application.
Encryption of data can be a very easy and elegant way to address the challenge without even building much of a solution into your cloud platforms. Most of the public cloud platforms provide several mechanisms that allow encryption on various layers of the platform at no additional cost. The great thing is you can automate remediation actions based on alerts if any kind of data is being stored unencrypted into a cloud. Non compliance to this standard is practically impossible.
Data Management Practice setup is a general requirement in order to make sure you have full visibility and (access) control over all the data your company holds, manages or has access to. Also it is important to run a proper metadata scheme across all the data types as complete as possible so it is searchable and can be clustered.
There are many more use cases in the Big Data field that require your attention, but I hope we’ve made our point. Just because you have data (in your lake), does not necessarily mean you can actually use it. GDPR demands that you have customer and employee consent, before using any form of data collected. At Nordcloud, we combine strong expertise in Big Data, Machine Learning and IoT field with years of AWS and Azure project delivery, all wrapped up in a deep awareness of data protection and security.
Please feel free to reach out to us if you think the above sounds familiar but perhaps too complex to tackle on your own. We’re here to help.