A groundbreaking new chatbot has emerged, demonstrating the ability to surpass even PhD students and postdoctoral researchers in conducting scientific literature reviews. According to a recent study published in Nature, this innovative large language model (LLM) can generate dependable summaries for an astonishingly low cost—less than a penny.
In this research, which aimed to enhance accuracy and reduce the frequent "hallucinations" typical of models like ChatGPT, experts from fields such as computer science, physics, neuroscience, and biomedicine were enlisted to evaluate the literature summaries produced by two advanced models: OpenScholar and its derivative, ScholarQABench. These were compared against reviews crafted by PhD candidates.
The findings, released on February 4, reveal that domain specialists—who themselves hold PhDs or are postdoctoral fellows—expressed a preference for the outputs of OpenScholar and ScholarQABench a notable 51% to 70% of the time, respectively. This preference is largely due to these models' remarkable capacity to deliver both wider and deeper information; their reviews averaged lengths of 1,447 and 706 words, significantly exceeding the average of 424 words found in human-generated summaries.
In contrast, summaries generated by ChatGPT were favored in only 31% of instances, primarily because they struggled with comprehensive information coverage, according to the study titled "Synthesizing Scientific Literature with Retrieval-Augmented Language Models."
One of the most significant advantages of OpenScholar is its lack of hallucinations, which plague other models like ChatGPT-4 or Llama, producing false citations in up to 90% of cases when tasked with referencing recent literature across various disciplines, including computer science and biomedicine. The study highlights that no hallucinations were detected in the reviews generated for computer science or biomedicine by the OpenScholar model. However, other LLMs often create what appear to be credible reference lists but contain fabricated titles in 78–98% of cases, particularly affecting the field of biomedicine. Even when citations correspond to real papers, they frequently lack proper validation from the associated abstracts, leading to a near-total failure in citation accuracy.
Unlike many other large language models that draw from vast swathes of internet data, OpenScholar’s 8 billion parameter model is specifically trained on a carefully curated dataset of 45 million scientific articles. This unique approach is designed to foster a "self-feedback loop" that enhances factual accuracy, coverage, and citation integrity. Since its demonstration launch, OpenScholar has attracted over 30,000 users and accumulated nearly 90,000 inquiries.
With respect to costs, the study indicates that utilizing OpenScholar for literature reviews can range from just 1 cent (0.7 pence) to 5 cents (3.5 pence) based on different pricing structures, enabling scholars to perform thousands of searches each month.
The authors of the study assert that the results, along with a substantial reduction in citation hallucinations, underscore OpenScholar's potential to facilitate and expedite future research endeavors. While they acknowledge that the system has its limitations and cannot fully automate the synthesis of scientific literature, they are committed to making both ScholarQABench and OpenScholar accessible to the academic community to promote ongoing exploration and improvement.
What do you think about the implications of AI models like OpenScholar in academia? Could they truly replace traditional methods of literature review, or do you believe there are still essential human elements in research that technology cannot replicate? Share your thoughts in the comments!