Energy API Data Ingestion and Migration

Arenko CleanTech Portfolio: Datalake Architecture and Historical Data Migration

Project Overview

Objective:
At Arenko CleanTech(an Energy trading company), we were responsible for architecting a robust and scalable datalake on AWS. The project handled telemetry data from various external partners and IoT devices across the UK. Our primary goal was to ensure data integrity, facilitate access based on usage, and provide valuable insights to consumers. Additionally, we migrated historical telemetry (time-series) data from Apache Cassandra and PostgreSQL to AWS Athena via AWS S3 buckets as a cost-saving measure.

Tech Stack

  • Languages and Frameworks: Python, Bash, PySpark, SQL
  • Databases: Cassandra, PostgreSQL, TimescaleDB
  • Cloud Services: AWS (Glue, Lake Formation, RDS, CloudWatch, Athena, S3)
  • Orchestration and Infrastructure: Prefect, Terraform
  • Version Control and Project Management: GitLab, Jira, Confluence
  • Monitoring: Grafana, AWS CloudWatch, PagerDuty

Key Responsibilities

  1. Architecting the Datalake:
  • Designed and implemented a datalake architecture using AWS Glue, Athena, and S3, ensuring the seamless ingestion and storage of telemetry data from diverse sources.
  • Developed Python and PySpark scripts to process and transform raw data based on specific business rules.
  • Established data governance and access controls using AWS Lake Formation to ensure data security and compliance.
  1. Data Migration:
  • Migrated historical telemetry data from Cassandra and PostgreSQL/TimescaleDB to AWS Athena via S3, optimizing for cost efficiency and query performance.
  • Validated data integrity and consistency post-migration to ensure seamless access and utilization.
  1. System Architecture and Documentation:
  • Created detailed reference diagrams for the system architecture, illustrating data flow and integration points.
  • Developed and maintained comprehensive runbooks for various processes to facilitate easy replication of solutions by other engineering team members.
  1. Cluster Maintenance and Troubleshooting:
  • Managed and troubleshooted Cassandra and PostgreSQL/TimescaleDB clusters, ensuring high availability and performance.
  • Performed routine maintenance and resolved issues with instances and nodes to maintain operational stability.
  1. Infrastructure Management:
  • Utilized Terraform to create and update infrastructure, ensuring scalable and reliable deployment of resources.
  • Implemented infrastructure monitoring using Grafana and AWS CloudWatch, and managed alerts and incident responses through PagerDuty.
  1. Production Support and Code Review:
  • Provided production support and participated in on-call rotations to ensure system reliability and quick resolution of issues.
  • Conducted code reviews to maintain code quality and adherence to best practices within the engineering team.

Achievements

  • Successfully architected a scalable and secure datalake on AWS, providing a centralized repository for telemetry data from multiple sources.
  • Achieved significant cost savings by migrating historical data to AWS Athena and optimizing query performance.
  • Ensured high availability and performance of database clusters through proactive maintenance and troubleshooting.
  • Enhanced team productivity and knowledge sharing by creating detailed runbooks and system architecture documentation.

At Arenko CleanTech, we leveraged a diverse tech stack and a strategic approach to deliver a robust data infrastructure that supports the company’s mission of providing clean and efficient energy solutions.

Skills

Posted on

July 13, 2024

Submit a Comment