Data science and machine learning are evolving in just about every single industry. The adoption of AI at companies continues to grow and evolve and AI developers are trying to prove that there is value that can be added to different parts of the company through machine learning. Not surprisingly, journalism, an industry whose primary focus is the communication of ideas in both text and visual format, has come to adopt the tools and techniques of data science to put power behind analysis and visualization of data.
The New York Times (NYT) has had a data science group since 2012, but only recently has this group moved out of the experimental phase and taken a major role in the company, adding value through machine learning. The Director of Data Science at the New York Times, Colin Russel, will be sharing some of the insights learned from the NYT data science team at an upcoming Data for AI event on November 4, 2021. Colin uses his background in predictive modeling and designing and applying machine learning algorithms to implement the Times’ vast quantities of data into models and visualizations that can help different segments of the company. In this article, we share some of his insights into where data science is heading at the NYT and beyond as well as insights previously shared by the NYT at the Data for AI conference in 2020.
Applications of AI
Colin Russel, New York Times
Colin Russel, New York Times
The NYT has invested in building out different machine learning teams that combine aspects of data science, data analytics, and engineering. These teams are centralized with different data science groups working with the newsroom, others with marketing, and others working with different business operations. Although each of these teams are focused on different aspects of the company’s overall mission, they are all looking to build a machine learning platform that can take all of the overlapping deployment and infrastructure development and centralize it for overall use.
Traditionally, the newsroom and editorial operations are separate from the business side of the company for obvious reasons of conflict of interest and maintaining a separation between revenue-generating and news-generating activities. Because of the separation of the data journalism side and the data science side of the organization, there is a separation of culture. Due to this separation, it is often challenging working in AI for a large company and it is crucial to have a lot of clear and constant communication around the process and goals of AI implementation.
The use of data to drive decision-making and insights is spread across the entirety of the organization, however, with data analysis being used to power both business decisions as well as journalistic and editorial insights. The newsroom is very interested in data and understanding the audience of the NYT in a world where many people are getting their news from social media. Likewise, operations is interested in data-driven insights to improve advertising performance, deliver optimized content to readers, and generate more visibility of various operations and offerings.
Technology for AI
While many companies outsource their AI tools, the NYT is focused on building, not buying. Implementing AI technology is often not the hardest part a project, but rather engineering, organizing, and manipulating the data to where it can be efficiently modeled is often the challenge. Years ago, data was all over the place and as a data scientist trying to use data from different sectors of the company, you needed to get credentials for every different part. Add the difficulty obtaining data to the difficulty deciding what parts of the data are appropriate to be used for the model and this makes the actual technology for AI a smaller issue.
Due to the different areas of focus and priorities for different parts of the company, AI developers must figure out how to balance the competing concerns. The NYT recently went through an overhaul where they wanted to consolidate data on the cloud. This gave them the opportunity to start fresh and make it easy to upload data from different parts of the company.
Dealing with Variability
Data science and machine learning models are verified and evaluated to measure baseline performance as well as testing model improvements that are being developed. One of the main difficulties in taking advantage of AI is the difficulty in quantifying the goal and choosing the metric that you want to optimize. In the news and journalism industry, there is a lot of variability based on news cycles. For example, the Covid-19 pandemic has changed the company a lot as it is now giving free access to Covid-19 related news. The subscription business that wants as many subscribers as possible now has a public service component and believes having free access to information at a certain level is very important.
Certain types of recommendation algorithms respond better in different types of news cycles. Models are retrained as of protocol and the performance of a model must be taken into context with the news cycle. To evaluate the quality of a model, it must be taken over a longer period due to news cycles and environmental effects. Figuring out which models to use in each news cycle is a challenge that Colin and his team are looking to solve.
Implementing AI and ML algorithms can be a challenge in any company and determining the technology, metrics, and data to be used is very difficult. The New York Times handles these issues daily, with greater details and insights to be shared at the upcoming Data for AI event.