How Reliable Are AI Tools for News and Research? A Comparative Study

Inside this Article

Key Takeaways How Did We Do It?Performance Analysis of AI Tools Methodology Trends and Key Insights Conclusion

Key Takeaways

ChatGPT produced up-to-date, accurate, and well-organized information in over 90% of our tests. However, it didn’t provide any sources or references for its data, greatly impacting its reliability for news and research.
Perplexity AI listed multiple credible sources for the information it provided in the first prompt, making it the most reliable tool for conducting research out of all AI search engines included in our study.
Bing yielded quality information for all the questions except for one — which it failed to answer. Bing included several reputable sources in over half (56%) of its responses, but it cited unreliable sources in 4 of its outputs (36%).
Mistral AI and Claude didn’t offer real-time information in any of their responses, making them largely unreliable for researching current events and rapidly evolving topics. Mistral AI was last updated in October 2023, while Claude only has information up to April 2024.
Google Gemini’s responses were up-to-date and easy to understand, but it only presented sources to back up its claims in 2 out of 11 responses (18%). It is also the only AI tool that failed to answer 2 of our questions.
SearchGPT usually listed reliable sources in the first prompt request, all of them with links for further reading. Although it scored the lowest on clarity (an average 2.9 out of 4), most of its answers are still well-structured and easy to understand.

The advent of artificial intelligence (AI) technology has increased efficiency and productivity in many industries, perhaps none more so than online content creation and research. Due to their uncanny ability to understand commands and produce human-like text, AI search engines, especially ChatGPT, have changed the way researchers write and discover academic literature as well as news and current events.

But how much can we rely on AI to produce accurate, up-to-date, and unbiased information from reputable sources? We at Website Planet performed an analysis of the top AI tools to see how reliable they are for newsgathering and conducting research. You can read our findings below.

How Did We Do It?

In our study, we included some of the most popular and widely used AI search engines currently on the market, including ChatGPT, Perplexity AI, Bing, Claude, Mistral AI, and Google Gemini. SearchGPT was included last, as it was released after we conducted our initial study.

We wanted to make our analysis as comprehensive as possible and truly test the reliability of AI search engines and large language models. So, we asked each AI tool a series of questions across 11 different categories, including historical research, product comparisons, and niche technical information.

Then, we analyzed each AI response against a set of criteria, such as how up-to-date the information is, whether it lists any sources or references, and how relevant the sources are. You can find the full set of questions and criteria at the bottom of this article.

Depending on the AI’s performance, each criterion was ranked on a four-point scale as poor, fair, good, or excellent. We averaged each category’s results and came up with an overall score for each AI tool. Finally, we compared their performance against each other.

Performance Analysis of AI Tools

In the section below, you can see how each AI tool performed in our study.

ChatGPT

ChatGPT is an AI chatbot and virtual assistant developed by OpenAI and launched in November 2022. It was built using Transformer Models, specifically the GPT architecture, which is a type of neural network designed for processing and generating text.

To evaluate ChatGPT’s reliability for news and research, we asked it 11 different questions across various categories. For instance, we prompted it to elaborate on the election issues that people in the United States are most concerned about.

While it provided relevant and accurate information on the topic, it didn’t present sources or references. It also didn’t offer multiple perspectives, especially on controversial issues like immigration and gun control. This greatly reduces the AI’s reliability, and it impacts the depth and trustworthiness of the result.

This trend continued with the rest of ChatGPT’s responses — it didn’t offer the sources for any of its answers, but the information provided was perfectly clear 10 out of 11 times.

For example, when asked to define the AI Safety Bill, it gave a clear and informative answer, avoiding overly technical jargon and addressing current information on global legislative efforts related to AI. However, the absence of sources, supporting data, and diverse perspectives weakened the output’s overall reliability and depth.

We noticed a similar trend in the rest of the categories we tested ChatGPT in, including medical research, niche technical information, linguistic diversity, cryptocurrency, and educational resources.

Perplexity AI

Perplexity AI is a powerful research and writing tool that has been on the market since 2022. It can assist with finding relevant sources, synthesizing information, editing, proofreading, and other tasks. It uses Advanced NLP and transformer-based language models to understand and process user queries.

When we tested how reliable Perplexity AI is for conducting research, we found its responses to be up-to-date, reflecting the latest information and trends in the subject matter.

Furthermore, the responses were clear and lacked overly technical jargon, making them easy to understand for a broad audience. All the outputs seemed unbiased, and they were fully consistent with information found in other reputable sources over 70% of the time.

This AI search engine also provided multiple reliable references immediately in the first response, allowing users to further explore the topic they’re researching. For instance, when asked to define the AI Safety Bill, Perplexity AI cited sources like the New York Times, LA Times, and Politico, which are credible for news on legislation and technology.

However, it sometimes offered links to more general websites, such as Wikipedia, which are unsuitable for academic-level research. This slightly lowered the overall reliability of Perplexity AI’s sources, scoring 3.1 out of 4 in this category.

Most of Perplexity AI’s answers included a variety of perspectives, but it could sometimes benefit from offering different viewpoints on the topic. For instance, when asked about language diversity in Europe, the answer covered various language types (such as official, regional, and immigrant languages) but it didn’t expand on specific challenges derived from language diversity or mention any relevant policies in different countries.

Sometimes its answers lacked specific data points or detailed statistics to fully support its claims. For instance, when asked to provide the economic causes for the fall of the Roman Empire, Perplexity AI’s answer failed to provide more specific evidence, such as quantitative data on taxation or inflation.

Bing

Bing, also known as Copilot, is an AI-powered chatbot owned and developed by Microsoft. Beyond its search engine capabilities, Bing can also perform various other tasks, such as generating content, providing answers to simple and complex questions, and conducting research. It’s based on Open AI’s GPT-4 generative machine learning model, though its outputs aren’t the same.

Over the course of our analysis, we found that Bing provided up-to-date and relevant information that reflected the latest developments and discussions on the subject matter. Bing’s information was also unbiased and presented clearly, with a well-organized structure and no overly technical language in 90% of its answers.

Bing also provided multiple credible sources in the first prompt, like MIT Technology Review TechRadar, and Popular Science. However, in 40% of the answers, its sources lacked the authority of more academic or governmental platforms. For instance, when asked about the latest advancements in quantum error correction techniques, Bing provided sources from preprint servers like arXiv, which aren’t yet peer-reviewed.

While Bing’s responses were comprehensive and informative, they sometimes lacked multiple perspectives or opposing viewpoints. For instance, when asked what causes high blood pressure, it listed various causes (e.g., age, stress, obesity) but failed to provide a deeper analysis or alternative perspective on risk factors (e.g., genetic predispositions vs. lifestyle choices).

In addition, Bing failed to answer the question about which election issues people in the United States are most concerned about. Instead, it stated the following: “I wish we could talk about elections, but it’s a complex topic that goes beyond my training. Sorry!”

Bing also stated that, while it can provide factual information about elections and political events, it avoids promoting any political views or engaging in partisan discussions.

When asked about which other topics it cannot discuss, Bing listed medical, health, and legal advice, stating that it can only offer general information on the topics but is unable to provide specific advice. Additionally, Bing avoids discussing personal data and privacy, harmful or dangerous content, and sensitive social issues, such as race, religion, and gender.

Claude

Claude is an AI assistant created by Anthropic. It can be used for various tasks, including analyzing information, answering questions, helping with math and coding, creative writing, and discussing a wide range of topics. According to Anthropic, Claude used a mix of public information, datasets licensed from third-party businesses, and data shared by users.

While conducting our study, we found that Claude only has information up to April 2024. So, using it to explore current events or look up recent developments in any field may be unproductive. Although it can be a good option for general inquiries, such as asking it to provide the best prompts for creative writing, Claude is an overall poor choice for conducting research.

For instance, when asked about trends in cryptocurrency, Claude gave a clear, well-organized answer, but it didn’t cite any sources, making it difficult to verify its claims. It also offered no links or suggestions for further reading, so users can’t easily expand on their research based on Claude’s answer.

Mistral AI

This French AI platform was founded in 2023 by former Google DeepMind and Meta employees. As a large language model, Mistral likely trained on a dataset similar to other models like Claude and GPT-4. However, the specific details of Mistral’s training data are proprietary information, and the developers have not disclosed the exact sources or methodologies used.

During our analysis, we found that Mistral AI was last updated in October 2023. As a result, the responses did not reflect the most current developments. For instance, when asked about the biggest tech-related news of 2024, Mistral addressed the query to an extent by outlining potential highlights, but it failed to offer concrete news from the year in question.

In addition, it didn’t cite any sources or references to indicate where the information is derived from. Mistral’s responses also lacked concrete evidence or specific data that supports the general claims made.

Overall, even though Mistral AI provided easy-to-understand and straightforward responses for 95% of the queries, the lack of sources or references made it difficult to verify their accuracy. As a result, this AI search engine isn’t a good choice for researching current events or rapidly evolving topics.

It can, however, be used for other tasks. For instance, it provides translation in multiple languages, defines words, and provides synonyms or antonyms. It can also help with simple tasks, such as calculations or conversions.

Google Gemini

Google Gemini is a generative AI model that was released in March 2023. It performs several functions, from content creation to image generation, code writing, brainstorming, and interacting with Google’s other apps and services. It is based on the LLM of the same name, which uses a transformer model-based neural network architecture.

During our research, Google Gemini provided up-to-date responses that reflected the latest developments on the subject at hand. The information was also presented in a clear and understandable manner, with minimal jargon and easy-to-follow explanations.

On the other hand, half of Gemini’s answers (55%) lacked multiple perspectives, which led to biased responses. For example, when asked about the best 3D printers for beginners, it focused only on the positive aspects of the recommended devices and failed to mention any drawbacks.

Perhaps Google Gemini’s biggest downside is that, in most cases, it didn’t offer sources or references for its information. During our research, sources were only provided for two of our questions.

More specifically, when asked to provide the most recent advancements in quantum error correction, it referenced reliable sources like IBM, Nature, and academic institutions. However, not all claims were directly linked to these sources in the first response.

When prompted to provide the causes of high blood pressure, Google Gemini also provided reputable sources in the initial prompt, including websites like Johns Hopkins and Mayo Clinic. However, no sources or references were provided for the rest of our questions. That makes it the only AI tool in our research that sometimes provided sources and sometimes didn’t.

Gemini is also the only tool in this study that failed to answer two questions: what the AI Safety Bill is and what election issues people in the United States are concerned about. Instead, it stated the following: “I can’t help with that right now. I’m trained to be as accurate as possible but I can make mistakes sometimes. While I work on perfecting how I can discuss elections and politics, you can try Google Search.”

SearchGPT

SearchGPT is a generative AI search engine developed by OpenAI and launched in 2024. It leverages the advanced technology of GPT-4, combining traditional search engine features with generative pre-trained transformers to generate responses.

During our research, SearchGPT provided up-to-date and relevant information, reflecting recent developments in the subject in question. This AI tool also offered sources in the first output for each one of our prompts, providing credibility and context for the information provided.

Overall, most of SearchGPT’s sources were reputable, including the Wall Street Journal, PCMag, and World Economic Forum, though some lacked academic rigor, like Wikipedia, The Sun, and ABC News. The best thing about SearchGPT, though, is that it lists sources as links, which makes it easy to follow up on most pieces of data.

Like many of the other AI tools in this research, SearchGPT often failed to include multiple perspectives in its responses. Nearly half (45%) of the answers provided lacked a broader analysis or contrasting scholarly interpretations on the issue at hand. Similarly, SearchGPT did not provide specific data such as rankings or statistics in around 90% of its responses — and yet it earned the second-highest score in this criteria.

Our Final Rating of AI Tools for Research

Looking at the results above, we can conclude that Perplexity AI is the most suitable AI search engine for news and research, earning 3.5 out of 4 points across all categories.

It offered the most consistent quality responses, with credible sources and references that allow the user to further explore the topic and check the accuracy of the information. Its information was both up-to-date and easy to understand, reflecting the most recent developments in each of the categories we tested it in.

SearchGPT comes second in our rating, earning an average of 2.9 points. Although it sometimes lacks a broad perspective, SearchGPT is more likely to offer multiple sources in the first prompt request and all of them come with links. This makes it one of the best options on our list if you are interested in following up on the information provided.

Microsoft’s Bing ranks third, with a score of 2.8 out of 4 possible points. Like Perplexity AI, it offered up-to-date and clear information, providing credible sources in the first response. It did, however, fail to provide an answer to our question about the elections in the United States, which means it’s not always the best choice for covering more complex topics.

Next up is ChatGPT, with 2.4 points earned. While it provided accurate, up-to-date information, its biggest downfall was the lack of citations or evidence for its data, as this makes it difficult for users to verify and further explore its claims.

Google Gemini comes fourth, earning an average of 2.2 points across all 11 categories. Much like ChatGPT, it offered quality and up-to-date information, but it was rarely supported by sources or evidence, which greatly affected its score. In addition, it’s the only AI tool that failed to answer two of our questions.

Mistral AI and Claude come last in our ranking, each earning 2 points. These AI tools didn’t offer real-time information, which can be critical when conducting comprehensive research. However, their responses were clear and well-organized, so these AIs may be more suitable for general inquiries that aren’t necessarily time-sensitive.

Methodology

To determine how reliable the AI search engines are for news and research, we gave each tool 11 prompts across the following categories:

Current Events: What election issues are Americans concerned about?
Breaking News: Highlights of the biggest tech-related news of 2024
Niche Technical Information: What are the latest advancements in quantum error correction techniques for fault-tolerant quantum computing?
Product Comparisons: Top 3D printers for beginners
Legal and Regulatory Information: In one sentence tell me what the AI Safety Bill is
Historical Research: The economic causes of the fall of the Roman Empire
Medical Information: What causes high blood pressure?
Cultural and Linguistic Diversity: What are the language diversities in Europe?
Creative Queries: Best prompts for creative writing
Financial and Economic Analysis: Trends in cryptocurrency regulation
Educational Resources: Top universities for AI research

Testing these topics gave us a comprehensive view of the strengths and weaknesses of each AI search engine and allowed us to understand how effective they are in delivering accurate, up-to-date, and objective information in a user-friendly format.

Evaluation Criteria

We used the following criteria to analyze each response:

Currentness: Is the information up to date?
Clarity: Is the answer clear and understandable?
Pertinence: Does the result address the specific question or query?
Diversity: Does the result include multiple perspectives?
Objectivity: Is the result free from bias or misleading information?
Consistency: Is the answer consistent with other reliable sources?
References: Does the AI include sources in the first prompt request?
Reliability: Is the source reliable?
Evidence: Is there evidence or data to support claims?
Links: Is the information linked to relevant sources for further reading?

Then, we graded each response according to the following scoring system:

Excellent (4 points)
Good (3 points)
Fair (2 points)
Poor (1 point)

We evaluated the AI’s performance in each category based on the average score for the 10 criteria. For instance, here is how Google Gemini performed in the Creative Queries category:

AI Tool	Category	Currentness	Clarity	Pertinence	Diversity	Objectivity	Consistency	References	Reliability	Evidence	Links	Average
Google Gemini	Creative Queries	4	4	4	3	3	2	1	2	1	1	2.5

We also evaluated the AI’s performance on each criterion based on the average score for the 11 questions. For example, below you can see how objective ChatGPT’s responses were:

AI Tool	Criterion	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	Q10	Q11	Average score
ChatGPT	Objectivity	3	3	3	3	3	3	3	3	4	3	3	3.1

Finally, we averaged those numbers to come up with the overall rating for each AI tool.

Excellent (3.5–4.0 Points): The results are highly accurate, reliable, well-referenced, and objective across most or all criteria.
Good (2.5–3.4 Points): The results are mostly reliable and clear, but there may be minor issues in one or two criteria (e.g., lack of multiple perspectives).
Fair (1.5–2.4 Points): The results are somewhat useful but may lack depth, clarity, or accuracy in several areas.

Poor (1.0–1.4 Points): The results are unreliable, biased, or missing critical information like sources or relevant perspectives.

Trends and Key Insights

Looking at the results of our research, it’s obvious that AI can be a powerful tool for conducting research and news gathering, but with a few caveats. First, not every AI tool is suitable for every type of research.

For instance, if you want your information to reflect the most recent developments, some AI search engines, like Mistral AI and Claude are not a good fit for the simple reason that they don’t offer real-time information (Claude only has information up to April 2024, while Mistral AI was last updated in October 2023). This can lead to missing out on key information and developments in the topic you’re researching.

Other AI search engines may offer up-to-date responses but often lack sources or references to prove the accuracy of the information they provide. ChatGPT and Google Gemini are good examples of this. None of ChatGPT’s responses contained any sources or references, while only 18% of Google Gemini’s responses did.

Looking up every piece of information manually to see how accurate it is and where it came from can be extremely time-consuming, so these AI tools are also not ideal for performing research, especially academic research.

Even Perplexity AI, Bing and SearchGPT, which performed best in our research, fell short in some areas. While offering up-to-date, accurate, and objective information from multiple credible sources, they often failed to provide multiple perspectives or opposing viewpoints in their responses. Bing, for instance, earned an overall “fair” score (2.1) in this category.

In addition, the credibility of their sources was questionable at times, with sources like Wikipedia clearly lacking academic rigor. For example, 80% of Perplexity AI’s answers cited some lesser-known or unreliable sources. That is also the case with 64% of SearchGPT’s answers.

Despite these shortcomings, the latest research on the topic shows that most researchers are increasingly using AI. For instance, A 2024 Oxford University Press (OUP) survey which included 2,345 respondents found that over three-quarters of researchers use some form of AI tool in their research.

AI-powered research tools or search engines like the ones included in our study were the third most popular AI tools, used by up to 25% of respondents, right behind translation machines (49%) and chatbot tools (43%). The survey found that the use of AI tools happened across all stages of research and was most useful for discovering, editing, and summarizing existing research.

While our research found that most AI search engines are far from ideal for conducting research, it’s encouraging that some companies are making an effort to provide accurate information. A good example of this is Infactory, a fact-checking AI startup that specializes in structuring and refining data for AI applications, aiming to ensure accurate and high-quality results.

The company, founded by two former Humane executives, has already raised $4 million in seed funding. It is targeted toward developers who build AI apps, offering features such as data mapping, source tracking, and auditable answer trails. This could help build trust in AI outputs, vastly improving their reliability in the near future.

Conclusion

AI presents many advantages and opportunities that can considerably improve the quality of research. It can help researchers save valuable time by summarizing lengthy content, highlighting important information, and providing a large volume of data in a relatively short time span.

That said, AI tools should still be used with caution, as they could also relay fake news, biased content, and misinformation. The human element is an invaluable part of the research process, so it is essential that researchers fact-check and critically evaluate the outputs of AI, applying their expertise to discern the accuracy, sensitivity, and relevance of the information provided.