Google Bard AI - What Sites Were Used To Train It? - Search Engine Journal

Last updated Friday, February 10, 2023 06:05 ET , Source: NewsService

Google's Bard AI is trained using website content but little is known about how it was collected and whose content was used

Details of websites used to train Bard/LaMDA are shrouded in secrecy
50% of training data is from public forums
Programming Q&A websites and tutorials sites used for training

Replace (not provided) with ALL of your organic keywords inside of Adobe & Google Analytics. Analyze performance by 400+ dimensions and metrics.

TRY FOR FREE

Google’s Bard is based on the LaMDA language model, trained on datasets based on Internet content called Infiniset of which very little is known about where the data came from and how they got it.

The 2022 LaMDA research paper lists percentages of different kinds of data used to train LaMDA, but only 12.5% comes from a public dataset of crawled content from the web and another 12.5% comes from Wikipedia.

Google is purposely vague about where the rest of the scraped data comes from but there are hints of what sites are in those datasets.

Google’s Infiniset Dataset

Google Bard is based on a language model called LaMDA, which is an acronym for Language Model for Dialogue Applications.

LaMDA was trained on a dataset called Infiniset.

Infiniset is a blend of Internet content that was deliberately chosen to enhance the model’s ability to engage in dialogue.

The LaMDA research paper (PDF) explains why they chose this composition of content:

“…this composition was chosen to achieve a more robust performance on dialog...

Read Full Story: https://news.google.com/rss/articles/CBMiRWh0dHBzOi8vd3d3LnNlYXJjaGVuZ2luZWpvdXJuYWwuY29tL2dvb2dsZS1iYXJkLXRyYWluaW5nLWRhdGEvNDc4OTQxL9IBAA?oc=5

Your content is great. However, if any of the content contained herein violates any rights of yours, including those of copyright, please contact us immediately by e-mail at media[@]kissrpr.com.