Jiří Moravčík - The data behind the success of (not only) large language models [PyData Prague]

Explore data behind large language models, including GPT-3, and learn about filtering, biased data sets, and open-source platforms.

Key takeaways

The speaker shared a story about being unable to recognize their own handwriting, highlighting the limitations of human recognition capabilities.
The presentation focused on the data behind the success of large language models (LLMs), including GPT-3 and others.
The data sets used to train LLMs include common crawl, which is a large corpus of web pages, and book corpora, such as Project Gutenberg.
The speaker mentioned the importance of filtering and cleaning data to remove noise and duplicates, and highlighted the challenges of dealing with biased data sets.
The presentation also touched on the topic of facial recognition and its applications, including the use of large-scale data sets like MS Celeb and VGGFace2.
The speaker discussed the transformer architecture and its advantages in processing sequential data, and mentioned the use of convolutional neural networks (CNNs) in image recognition tasks.
The presentation highlighted the importance of open-source data sets and platforms, such as GitHub and Stack Exchange, in facilitating collaboration and innovation in the AI community.
The speaker also mentioned the need for transparency and accountability in the development and use of AI models, particularly in applications involving personal data.
The presentation concluded with a demo of a facial recognition system using the Epiphy platform, which uses a combination of computer vision and machine learning algorithms to identify individuals.
The speaker emphasized the importance of considering the ethical implications of AI applications, particularly in areas like facial recognition and data privacy.

Jiří Moravčík - The data behind the success of (not only) large language models [PyData Prague]

More talks