Jiří Moravčík - The data behind the success of (not only) large language models [PyData Prague]

Ai

Explore data behind large language models, including GPT-3, and learn about filtering, biased data sets, and open-source platforms.

Key takeaways
  • The speaker shared a story about being unable to recognize their own handwriting, highlighting the limitations of human recognition capabilities.
  • The presentation focused on the data behind the success of large language models (LLMs), including GPT-3 and others.
  • The data sets used to train LLMs include common crawl, which is a large corpus of web pages, and book corpora, such as Project Gutenberg.
  • The speaker mentioned the importance of filtering and cleaning data to remove noise and duplicates, and highlighted the challenges of dealing with biased data sets.
  • The presentation also touched on the topic of facial recognition and its applications, including the use of large-scale data sets like MS Celeb and VGGFace2.
  • The speaker discussed the transformer architecture and its advantages in processing sequential data, and mentioned the use of convolutional neural networks (CNNs) in image recognition tasks.
  • The presentation highlighted the importance of open-source data sets and platforms, such as GitHub and Stack Exchange, in facilitating collaboration and innovation in the AI community.
  • The speaker also mentioned the need for transparency and accountability in the development and use of AI models, particularly in applications involving personal data.
  • The presentation concluded with a demo of a facial recognition system using the Epiphy platform, which uses a combination of computer vision and machine learning algorithms to identify individuals.
  • The speaker emphasized the importance of considering the ethical implications of AI applications, particularly in areas like facial recognition and data privacy.