AWS Glue: Cost Effective scalable ETL solution

Fraser sequeira
5 min readApr 30, 2020
Our Reporting Architecture using AWS Glue

In this article I would like to share the challenges faced when choosing AWS Glue as our choice for reporting and compliance data store. We are a company that deals with IOT devices that continuously sends chunks of data to cloud and we can also execute remote commands from our cloud service. Our event/command payload is lesser that 5MB, however the frequency of solicited/unsolicited events received and commands being fired would result in us having to maintain anywhere between 190~200 GB data per day. We were looking for a fully managed solution that could retain around 15 months of data and also allow us to query it in a cost optimized way.

AWS Glue is a promising managed spark service that can handle loads of data, analyze it and transform it to compressed query friendly(Parquet) data formats. It drastically reduced our data-source management, up-gradation and deployment effort.

Following are some important things to consider at the design stage when starting with AWS Glue

  1. Partitioned Data : When choosing a set of partitioning keys one must ensure that the number of partitions don’t grow exponentially over time. Bad data partitioning could lead to large number of smaller files, request throttling from S3 (s3 slow down request errors) and also long running glue jobs. For our architecture we chose device-type/operations as part of the partitioning key in addition to year/month/day. Lets say we have around 30 device types each supporting 20 operations which are executed daily that would result in around 600 partitions a day, 18K partitions a month and 200K partitions a year, that’s well below the limit on the number of partitions a Glue Table can support which is around 10 million. Also now that we have stored our data by device-type and operations we could easily fetch 15 months of installation(operation) data on DeviceType A in a cost optimised way. datasink4 = glueContext.write_dynamic_frame.from_options(frame = applymapping2, connection_type = “s3”, connection_options = {“path”: s3_output_path, “partitionKeys”:[‘year’,’month’,’day’, ‘devicetype’, ‘operation’]}, format = “parquet”, transformation_ctx = “datasink4”)
  2. Bad Column Names : AWS crawler cannot handle non alphanumeric characters. If a glue crawler encounters a special character in a parquet schema it simply terminates throwing an internal service exception. If possible you could add strict checks on an event schema so that non alphanumeric characters don’t propagate through your system or you could manage it in your glue job itself. Validate your parquet file with the below windows software: https://www.microsoft.com/en-us/p/apache-parquet-viewer/9pgb0m8z4j2t . Some examples of Bad column names in our payloads: payload.assetInfo.firmware.ApplicationSoftwareDSS(DB)
    payload.assetInfo.firmware.afp-diag-server
    payload.assetInfo.firmware.afp-pumpmonitor-server
    payload.assetInfo.firmware.CPB539 U-Boot
    payload.assetInfo.firmware.LAM (Legal Authority Module)
    payload.assetInfo.firmware.Core-FWR[Build]
  3. Data Conversion (Parquet): Converting our json data to parquet helped reduce data storage by 95%. This columnar storage is best suited for querying via Athena and also for quicksight analysis. Our raw Json data which is about 194 GB/day when converted to parquet resulted in around 10 GB/day. Hence even with a bad Athena query that scans the entire 300 GB data for a month it would cost us around $1.5 for data scanned(Athena Query cost -> $5 for 1 TB).
  4. Relationalize: This function helps unnest or flatten your json wherein each key in your json then translates to a column in the glue table. Use it if you want to query every field in a deep nested json. Also if your payload has regular arrays that you wish to query you would have to write every array to an S3 location which can then be crawled/joined to your main table and queried using Athena. The only drawback of relationalize is that it doesn’t retain the original payload but you could always workaround that by maintaining the original payload as a string -> json.dumps(dynamicRecord[“payload”]). Our table in Production has around 469 columns which can be queried upon through Athena.
  5. Smaller sized files a nightmare: We have a Logstash script that dumps data in S3 which is then picked up by the glue job. To enhance the speed of a job run and also to reduce cost it’s best to have larger sized files rather than having a huge set of small sized files. We modified our Logstash configuration to push files of atleast 25 MB that improved our glue job performance.
  6. Job Run Cost : We have setup a 2 glue jobs with bookmarks enabled with G.1X worker type with max capacity of 40 and 11 respectively. The job runs every 2hrs after which the crawler(conditional trigger) adds new partitions to the table. The entire glue workflow has an execution time of 25 minutes and its costs us 3000 DPU-hour ~$1400 a month.
  7. Glue Job metrics : We have enabled job metrics to track the data reads/writes and executor performances. It also helps analyse how many executors are necessary/active for a job run.
Glue Job metrics

8. ETL Workflow : We used ETL workflows to configure our glue setup as we had two dependent glue jobs and a crawler to trigger on success of the second glue job. A workflow is way more maintainable than individually configuring triggers for every part of your setup. Also it helped us understand the execution time taken by the entire ETL pipeline.

ETL workflow setup

Conclusion: AWS Glue now comes with two new features of processing data from Kinesis streams and reduced start-up times. We have successfully migrated from our legacy data storage system to this AWS managed service saving around $6000/month. It also has helped us retain data for a longer duration of 15 months without worrying about cluster/node scalability issues or indexing issues. You would still need to learn a bit of pyspark functions when writing a glue job as sometimes you would need to switch between AWS DynamicFrames/SparkDataFrames and RDD’s in your glue job.

--

--