Developing and Maintaining Data Pipeline Architecture at Realinflo
Hello! I’m Imen Chaieb, Data Engineer Intern at Realinflo with a background in Business Analytics and Information Technology.
Realinflo is a data analytics platform that provides granular, instant and verified building and transaction data on markets across Asia-Pacific. I’m currently developing and maintaining a data pipeline architecture using mainly Python Programming and Apache Airflow (an ETL workflow orchestration tool); aiming to optimize the performance of our company’s big data ecosystem.
I have been always taking the data as a given when doing machine learning projects and never thought about how critical it is to learn about good data engineering practices. As you know “garbage in garbage out” – having good datasets is as important as good models. And that’s what Data Engineers are there for; they gather, validate, clean, and structure data prior to use.
Most Data Engineers have done software engineering, so you can tell how much I had to learn to carry out my tasks. That’s why I’d like to share with you what I have learnt during my first month.
1 – Communication (can’t stress this enough!)
Communication might not be our strongest skill, but especially when you’re new to the industry (e.g real estate), making sure you understand the data well and how it will be used later on is crucial. Don’t be afraid to ask questions and clarify any ambiguities.
2 – Pay attention to details
Working with unstructured data could feel like stepping on a lego – it is mentally challenging, so always take your time in checking the quality of your output, it’s worth it trust me! This will save you the headache of dealing with tasks failure in your data pipelines later on.
3 – Google is your friend
Most problems encountered already have a solution, if not a plethora. Search them on the internet, read research papers if necessary, and choose the best solution for your problem and adjust it to your case. Building solutions from scratch is great, but why lose time and effort when there’s already one!
4 – Optimize your code
As a former Data Analyst, I’ve never paid attention to the importance of optimizing the work performance for the CPU. Fully aware of that, I am now increasingly becoming obsessed with that. Design patterns such as computed / caching patterns help the CPU and saves you a lot of time.
“Elements of Reusable Object-Oriented Software” is a great book if you’d like to learn more about design patterns.
I hope you’ve found my first month’s takeaways useful. Stay tuned for my next blog post here, we’ll get more technical!