ITNEXT Summit 2019, Data Engineering track recap

Adi Polak
ITNEXT
Published in
5 min readOct 31, 2019

--

The 2019 edition of ITNEXT Summit took place on Oct 30, 2019, at Amsterdam, NL.

The event had around 450+ passionate attendees that came to learn from great experts and get inspired.

There were 3 tracks:

  • JavaScript
  • DevOps
  • Data Engineering

Each track was led and curated by an MC:

Content curators and organizers: from right to left: Tara Ojo(JaveScript), Holden Karau and Boo(Data Engineer), and Thiago de Faria(DevOps)

From looking at the agenda, one can say that the Devops was packed with Serverless sessions and the Data Engineer track had an emphasis on Machine Learning. This reflects what is going on in the European industry but not only Europe! Machine learning at scale is starting to take off globally and it seems like many companies and open source software trying to create a solution for the challenges. Pretty Exciting Times!

Data Engineering track had 5 sessions:

1. Bias in AI — choose your data wisely by Katrin Strasser:

Katrin Strasser kicked off the track and started with highlighting the unconscious bias that we all suffer from. That is a direct reflection of stereotypes, culture, education, surrounding and many more. She continued with examples of how AI is being used in institutions like banks, insurance, court, public service, military and more, here are some use cases scenarios:

  • Applying for loan — the probability of a person to pay back the loan
  • Court decisions — guilty/not guilty
  • Hospitals — What of the chances of a person to be healed
  • Public Employment Service — is it worth investing in the person to get a job or not?
  • Classifying people

Why data is so important? this is because AI knows nothing!

AI is strictly bounded to the data we feed into it.

Where is the bias? Data can be bias! Most of the data for classifying humans are produced by …. HUMANS! This creates a vicious cycle where bias is being emphasized by AI.

Word embedding from the Natural Language Processing (NLP) domain illustrates how stereotype bias comes from the corpus:

Katrin shows how the human classification algorithm is mostly based on features that we don’t have control over.

Scoring criteria from the Public Employment Services — it rates the chances on the job market

Katrin continued with face recognition and the error rate of 0.34% with 300K people per day, huge numbers.

What can we do to fix it? KNOW YOUR DATA!

Know your use cases and be aware.

Check the models! understand the common pitfalls

2. How to use Reinforcement Learning to solve the abbey of the crime by Juantomas Garcia

In his session, Juantomas Garcia walked us through the process of using Reinforcement Learning to win games, his example was focused on Abbey Of The Crime — a Spanish game. Combining beer and tech projects can result in bad engineering:

Step and notes from the project:

  1. Ported the game from Z80 assembly to C++.
  2. Measured the problem space and how many steps it needs to take to win.
  3. Create an embedded webserver to capture steps and game matrix — gathering the information was the part that took 95% of the work time, this is the hardest part of the project!
  4. Your computer doesn’t have enough processing power, containerize everything and deploy it to the public cloud. The Public Cloud offers many AI out of the box features and scale solutions. Here is a 200$ credit to start with Azure.
  5. The game and the project produced 1 TB of Data.
  6. You will probably write only 10 lines of AI, the rest are APIs and Data work:

7. The project is a work in progress! Not done yet. You can join the fun on Github.

3. Make your data FABulous by Philipp Krenn:

At this stage of the day, I was in desperate need of coffee.

Here is the session tweet thread by Holden Karau:

4. Organizing Data Engineering around Kubeflow by Mátyás Manninger:

The session started off with an interactive question for the audience — what are your data challenges? the audience filled in a form and we saw it live on the screen — pretty awesome!

We continue with the moving parts of building machine learning pipeline and how we can distribute the efforts across teams, as well as how can we use Kubeflow to visualize all the deployments and working pieces:

How we can integrate with the cloud:

It was a great session that walked us through Kubeflow, ML pipelines and how to make everyone work together.

The closing session for the Data Engineer track was:

5. Stream Processing beyond Streaming Data by Stephan Ewen

Stephan is one of the co-creators of Flink and he walked us through what are stream processing and what the future brings:

The talk walked us through 3 ways of stream processing concepts:

  • Stream processing with SQL — filtering, enriching, aggregations, join and more.
  • Stateful event-driven processing / statefull functions —coordination and maintaining state.
  • FaaS — functions as a service, comes from the serverless world, stateless most of the time, compute-centric.

Stephan finished off with an example of how to leverage all of these techniques into one system, in the demo example he showed a high-level architecture of Ridesharing backend:

An overall great session where we got a glimpse of where Flink is going and what the industry needs are.

This is a wrap!

This was only a glimpse of ITNEXT Summit 2019. Overall the conference had wonderful sessions, up to date content and full of openness and welcoming community atmosphere.

Thank you for everyone that made ITNEXT Summit 2019 possible!

--

--

Writer for

👩‍💻 Software Engineer 📚 Author of Scaling Machine Learning with Spark (O'Reilly) 🗣️ Keynote Speaker 💫 Databricks ambassador