Data Engineer

Data Engineer

Open-ended contract

Mission

The Data Engineer is responsible for building and maintaining the infrastructure that supports the organization’s data architecture. The role involves creating and managing data pipelines using Airflow for data extraction, processing, and loading, ensuring their maintenance, monitoring, and stability.

 

The engineer will work closely with data analysts and end-users to provide accessible and reliable data.              

                                       

What we expect from the candidate?

 

  • Candidate must be able to use Unix, must know how to use Unix commands to check processes, to read files, processes and run bash commands. Candidate needs to know how to access a Unix server and perform commands there. If some process is not running , needs to check the server to see what might be going on. For example, if a Hadoop/yarn process is not running of if some container for Airflow is not up, need to know how to investigate.
  • Candidate must know how to list Docker containers, how to build Docker images, how to change current images to add or remove things, how to use and map volumes. Must know how to maintain and setup a distributed Airflow environment using Docker, need to know how to build custom Docker images using Airflow image as base.
  • We strongly expect that the candidate knows Airflow , knows Airflow components, knows how to identify possible issues in the servers and fix them, knows how to add more workers to the cluster. Need to make sure the containers are running fine in the servers and if any issue, need to be able to fix.
  • Candidate must know how to maintain a Hadoop/Yarn cluster with Spark. Need to know which processes need to run in the servers, how to set up the xml files for Hadoop and Yarn, how to perform commands in HDFS. Need to be able to add a new worker in the Hadoop Cluster, if necessary, fix any possible issues in the servers. Need to know how to read the logs from Yarn and HDFS. Must know and understand how Spark works using Yarn as the resource manager.
  • Candidate must know how to develop in Python, how to manage packages with pip, review PRs from other people in the team and how to maintain and use a Flask API.
  • Candidate must know SQL, how to run queries with CTEs, window functions , mainly Oracle database.

 

Main Tasks:

  • Responsible for maintaining the infrastructure that supports the current data architecture
  • Responsible for creating data pipelines in Airflow for data extracting, processing and loading
  • Responsible for data pipelines maintenance, monitoring and stability                                        
  • Responsible for providing data access to data analysts and end-users
  • Responsible for DevOps infrastructure
  • Responsible for deploying Airflow dags to production environment using DevOps tools                                    
  • Responsible for code and query optimization
  • Responsible for data pipelines maintenance, monitoring and stability
  • Responsible for code review"                                           
  • Responsible for improving the current data architecture and DevOps processes
  • Responsible for delivering data in useful and appealing ways to users                                             
  • Responsible for performing and documenting analysis, review and study on specified regulatory topics
  • Responsible for understanding business change and requirement needs, assess the impact and the cost.                                          

Profile

Technical Skills:

 

  • Python  
  • Experience in creating APIs in Python 
  • PySpark   
  • Spark Environment Architecture 
  • SQL, Oracle Data Base                                                      
  • Experience in creating and maintaining distributed environments using Hadoop and Spark   
  • Hadoop ecosystem - HDFS + Yarn  
  • Containerization - Docker is Mandatory  
  • Data Lakes - Experience in organizing and maintaining data lakes - S3 is preferred  
  • Experience with Parquet file format                               
  • Apache Airflow - Experience in both pipeline development and deploying Airflow in distributed environment                              
  • Apache Kakfa     
  • Experience in automating applications deployment using DevOps tools - Jenkins is Mandatory, Ansible is a plus                                       

 

Language Skills

  • English                                                                                                                                      

Organization

Inetum is a European leader in digital services. Inetum’s team of 28,000 consultants and specialists strive every day to make a digital impact for businesses, public sector entities and society. Inetum’s solutions aim at contributing to its clients’ performance and innovation as well as the common good.

Present in 19 countries with a dense network of sites, Inetum partners with major software publishers to meet the challenges of digital transformation with proximity and flexibility.

Driven by its ambition for growth and scale, Inetum generated sales of 2.5 billion euros in 2023.

Country

Portugal

Location

Lisbon

Contract type

Open-ended contract

Apply