A dedicated data platform for all
Nowadays, everyone is looking for a place for data where information can be obtained, shared and visualised. Data platforms take care of exactly this problem. As EFS, we have been working in this field for over 10 years and have noticed that every data platform faces the same challenges:
- Which IT infrastructure is used
- How to upload data
- How to make the data searchable
- What metadata is available for the data
- How to store the metadata and generate a search index
- How the data is visualised
- How do the platform services communicate with each other
- How to connect IoT devices
- Roles and rights
- Data processing (virtual machines, cluster on demand)
So why not solve these fundamental problems and use the platform to drive innovation?
That is why we want to create a platform for everyone
We use many open source frameworks and also give the result back to the world in the form of open source. As soon as the platform has reached a maturity that we can publish. All services are developed as Infrastructure as Code, so that everyone can build their own platform!
What has been implemented so far?
- Automatic provisioning of IT infrastructure (AWS, Azure, OTC, local)
- Workflow engine (all automations are realised with workflows from the beginning)
- Search index (metadata is held in an Opensearch cluster)
- Data visualization with Kibana, Grafana
- Upload client
- Authentication service
Use case: Storing and analyzing large files in the system
Each data set consists of the data you want to make available via a platform and the metadata for the data. Data can be uploaded to the processing zone of the platform via an upload app. This requires authentication against a Keycloack server. If the user is authorised to upload the data, it is stored in the processing zone. As soon as the data has been uploaded, a workflow is started which evaluates the metadata. The metadata is used to decide which further processing logic is triggered.
The figure shows an example workflow for data in the EFS SDK. Once the data arrives in the processing zone, the information is anonymised. Names are removed from the metadata and assigned to an abstract ID. In the second step, meta-information is extracted from the additional data sets provided. This is done in the processing service, which provides, for example, a VM or a Docker to execute the processing logic. The data itself is then moved into the actual target data pool. The meta information is copied into OpenSearch and included in the search index. This enables a google-like search to find the data sets again.
For the visualisation Kibanais used. Here, data from OpenSearch can be displayed and aggregated. In order to visualize the contents of the data sets stored in the data pool, an analysis can be performed in the platform. For this purpose, it is possible to provision a virtual machine or, in the case of larger analyses, to provision a Spark Cluster in order to perform analyses.
In a virtual machine, a Windows or Linux machine is made available. Users can install the analysis tools they are already used to., which have already been used on local computers. The machines can be accessed via remote desktop. If large amounts of data need to be processed at the same time, a Spark Cluster can also be started up. , to carry out a Big Data analysis there. On the cluster, Jupiter Notebooks can be used to carry out the analysis.
The resulting KPIs or time series are stored in an OpenSearch index and then visualized in live dashboards. Regularly recurring analyses can be transferred into a workflow.. In this way, new data that match a certain pattern can be automatically analysed and added to the live dashboard.
Interest in achieving innovations reach? - Then let's talk about it talk how we can learn from each other!