The Data Janitor returns | Daniel Molnar

Data engineering challenges, machine learning, and the role of data science in business problems. Addressing common pitfalls, metrics, and skills shortage in the industry, with a focus on ETL tooling, data janitor, and more.

Key takeaways
  • Data is still dirty and has a lot of garbage
  • Authorization and ETL tooling are important in data engineering
  • The biggest problem is probably not enough data, not noise
  • People tend to overengineer, try to do too many things at once
  • You don’t have to have a huge team to do machine learning, but you need one person at least
  • You don’t need a whole company to get started with data science, but having a dedicated team is better
  • Data scientists are not going to solve your business problems, they just help you answer questions
  • MPS (Net Promoter Score) is an important metric to measure customer loyalty
  • A/B testing doesn’t always give accurate results, beware of Simpson’s paradox
  • You can’t always trust data, there are many potential biases
  • There are not enough people in the world who know how to deal with data, not even 0.1% have the skills
  • There are also not enough jobs in data science to go around, not even enough to solve all the problems
  • Business intelligence and data engineering are still quite separate disciplines
  • There are more and more problems with distributed systems
  • The state of data engineering is “okay”, with some exceptions
  • Some projects are just hype, some are solving real problems