Websites are Making It Harder for AI Models to Access Their Data. Is That Good or Bad?

The AI data commons, a resource for development of the technology, is allegedly facing an “emerging crisis in data consent.” According to one study from Data Provenance, a significant portion of web data sources now restrict AI-related activities. Web publishers are more likely to add files that tell crawlers they’re not welcome to scrape data.

This shift may impact the quality and quantity of data available for AI training, the report warns. As restrictions proliferate, AI datasets are at risk of potentially skewing output. But not every AI company is affected the same—Toronto’s Cohere AI, for example, is far more likely to gain access to a website’s data than mainstay ChatGPT.

Want to know more? Check out the source code here.