Data Engineering: Processing millions of sensor measurements on AWS - Lars Haferkamp

In a project for a major car manufacturer, my team and I implemented highly scalable Big Data jobs and infrastructure on AWS to process millions of daily sensor measurements from cars around the world.

Data Lake Zones

In this project we implemented the approach of Data Lake zones similar to the one described in the DZone article: Data Lake Governance Best Practices

The first zone, where all the data is stored in the format and with all the content as it’s ingested from the source, is called the “raw zone”. This raw zone is one of the original data lake concepts that became popular when cloud storage became cheap. This raw data zone needs the highest level of privacy protection, i.e. access should be highly restricted and it should have high security, e.g. encryption standards.

In the next step, large data processing jobs running on clusters convert the data stored in the raw zone into a more standardised format. Data quality should be checked and validation of the data should be performed. This second zone can be called the ‘trusted’ or ‘pre-processed’ zone.

Based on the data in this second zone, specific use cases can be implemented, such as calculating business relevant KPIs or training machine learning models. These use cases usually require only a subset of the data and perform some more specific pre-processing, such as aggregating data. The data can then be stored in a use case specific data zone and only people working on the use case should have access to the data. In addition, differential privacy can be applied to the use case specific data, such as injecting data noise that doesn’t change the KPI or distribution of the data.

Data processing with Spark on AWS

To transform large data sets between different zones, I implemented Spark jobs and ran them on AWS EMR.

Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Spark is very flexible, but if you have more standardised transformations, you can use AWS Glue, for example. The other advantage of Spark jobs is that it’s cloud vendor agnostic, but if you ever change the cloud vendor, it can be questioned and I always appreciate the advantage of standardised and fully integrated cloud-specific tools. For me, that’s one of the key value propositions of a cloud provider.

AWS has also adapted to the challenges that Spark poses as a distributed processing framework. For example, writing to S3 isn’t straightforward if you’re writing from multiple compute nodes at the same time and you’re assuming a certain state of the “file system”. S3 is not a POSIX compliant file system and it’s eventual consistency poses some challenges when storing and reading very quickly by distributed parallel jobs.

In addition, AWS offers very nice integration with spot instances, and Spark jobs on AWS can take advantage of these cheap instances and are resilient to spot instances being terminated during processing.

Processing geospatial data with Spark

I implemented Spark jobs that processed a lot of geospatial data (car sensor readings with their associated GPS position). This posed some challenges because we had to do map matching, i.e. find out which street a GPS position belongs to. This meant that we had to load the map data for all possible coordinates (Europe and North America) into memory, which we solved by clever partitioning and using Spark’s mapPartitions() method.

Not only is the geodata distributed all over the world and we need to partition it, but the sensor readings are not evenly distributed. In cities there are many more cars driving and sending data than in the countryside. So one standard geopartion (e.g. geo hash) might have millions of measurements per day, while another might have very few. We have implemented our own Spark partitioner for this unevenly distributed geodata. Spark can also be extended to use efficient spatial SQL, as we’ve seen in PostGIS.

Another option for processing big data and geodata on AWS is Athena (based on Presto), which also has support for some standard geospatial queries. It’s a SQL-based tool and provides ODBC/JDBC access, so it’s often used as an integration layer to other (BI) tools.

Summary

The concept of data lake zoning, while not a new concept, is still relevant and is supported by new services like AWS Lake Formation where you can specify data access and security policies much more easily than before.

Spark, as one of the first popular big data frameworks after Hadoop, is still relevant if you have specific data such as geodata, but it’s also the engine behind more standardised tools such as AWS Glue.

Attribution

Header photo by Samuele Errico Piccarini on Unsplash