Our study leverages a novel dataset linking Twitter profiles to academic records for 99,274 scholars globally, spanning from 2016 to 2022. This unique combination of social media and academic data provides an in-depth look at how academics express political and behavioral stances online.
Using scalable large-language model (LLM) based classification techniques, we categorize and label the content, focusing on both the substantive content (the "what") and the language and tone of expression (the "how"). Our analysis covers topics such as climate crisis, economic policy, cultural dimensions, and the tone of language used.
We began with a dataset of 300,000 academics on Twitter, matched to their OpenAlex profiles. OpenAlex offers extensive metadata on publications, citations, affiliations, co-authors, and fields of study. Using the Twitter API, we collected complete timelines of these academics, including tweets, retweets, quoted retweets, and replies.
To identify relevant topics in tweets, we employed GPT-4 to create keyword dictionaries for topics like climate action, immigration, and socio-economic policies. This dynamic approach ensures our dictionaries capture evolving discourse accurately. We filtered our dataset to focus on tweets related to these topics, reducing the corpus to around 22 million tweets.
Using GPT-3.5 Turbo, we classified the stance of each tweet towards its respective topic as pro, anti, neutral, or unrelated. This step involved a carefully crafted prompt and iterative refinement to ensure accuracy. We validated our approach using human-labeled datasets and achieved high F1 scores (~84-92), confirming the robustness of our stance detection.
For climate action tweets, we further classified narratives into techno-optimism, behavioral adjustment, both, or neither. This nuanced classification helps us understand the specific perspectives within broader stances on climate action.
We measured egocentrism by counting self-referential terms in tweets, a straightforward but effective method. Toxicity was assessed using Google's Perspective API, which provides a probability score for various toxic behaviors. Emotionality versus reasoning was analyzed using the LIWC dictionary, which categorizes words into cognitive and affective terms.
We determined gender using an LLM-based approach to classify names. Location and institutional affiliations were derived from OpenAlex data, and we categorized institutions by QS World Rankings. Fields of study were assigned based on the hierarchical classification of concepts in OpenAlex, allowing us to group academics into broad categories like Humanities and STEM.
To ensure a balanced sample, we included only those academics who tweeted at least once in the first and last six months of our study period. This criterion was also applied to the general US Twitter population sample, which included 61,259 users.
This methodological framework enabled us to conduct a detailed analysis of both the content and style of academic expression on social media. Our findings provide insights into the political stances, narrative structures, and behavioral traits of academics online, highlighting significant trends and disparities.