We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Jiří Moravčík - The data behind the success of (not only) large language models [PyData Prague]
Explore data behind large language models, including GPT-3, and learn about filtering, biased data sets, and open-source platforms.
- The speaker shared a story about being unable to recognize their own handwriting, highlighting the limitations of human recognition capabilities.
- The presentation focused on the data behind the success of large language models (LLMs), including GPT-3 and others.
- The data sets used to train LLMs include common crawl, which is a large corpus of web pages, and book corpora, such as Project Gutenberg.
- The speaker mentioned the importance of filtering and cleaning data to remove noise and duplicates, and highlighted the challenges of dealing with biased data sets.
- The presentation also touched on the topic of facial recognition and its applications, including the use of large-scale data sets like MS Celeb and VGGFace2.
- The speaker discussed the transformer architecture and its advantages in processing sequential data, and mentioned the use of convolutional neural networks (CNNs) in image recognition tasks.
- The presentation highlighted the importance of open-source data sets and platforms, such as GitHub and Stack Exchange, in facilitating collaboration and innovation in the AI community.
- The speaker also mentioned the need for transparency and accountability in the development and use of AI models, particularly in applications involving personal data.
- The presentation concluded with a demo of a facial recognition system using the Epiphy platform, which uses a combination of computer vision and machine learning algorithms to identify individuals.
- The speaker emphasized the importance of considering the ethical implications of AI applications, particularly in areas like facial recognition and data privacy.