Our devices track our behavior in ways that create unimaginable amounts of information, information that can be useful for understanding human behavior. Not surprisingly, the widespread availability of these data parallels an increasing interest in data science and related topics (see Figure 1). Social scientists are increasingly using “big data” to examine theoretically-grounded research questions. Yet, few social scientists, especially psychologists, have the skills and training experiences needed to engage with this rapidly growing area of expertise called “data science.”
Ideally, data science skills should be linked with traditional forms of scientific inquiry including social science theory, research methods and statistics. Below, we briefly describe this exciting new area within social science inquiry related to big data acquisition, processing and visualizing, as well as two entry-level analytic techniques.
Computational social science (CSS) lies at the intersection of social science theory, traditional statistical and research methods, and computer science (see Figure 2) and is rapidly gaining traction in psychological science (e.g., Eiler et al., 2018; Jones, et al., 2017; Ritter et al., 2014; Tamburrini et al. 2015; Udea et al., 2017; Youyou, Kosinski, & Stillwell, 2015). Students with CSS skills can use data to examine theoretically-sound research questions in virtually any area they are interested in. They also improve their post-baccalaureate opportunities.
Figure 1. Google Search Trends
Figure 2. Primary (gray) and secondary (white) learning goals.
Unfortunately, most psychology curricula do not incorporate data science skills. The same is true at our university, and thus, as a workaround, during the 2017-18 academic year, we designed a year-long research lab including a set of learning goals and objectives and activities to introduce, practice and apply basic CSS skills.
Unlike a typical classroom setting, the educational experience was a collaborative, experiential learning atmosphere that employed several pedagogical strategies including self-directed learning, peer-mentoring, guest lecturing, in-lab presentations and scientific conference dissemination. Students worked collaboratively with peers outside of the lab and relied on the guidance and leadership of more senior mentors on all projects. They also independently sought out and employed other resources online and shared them with the group. Social media data (e.g., Twitter, Reddit) were selected due to broad applicability across interests, textual nature and amenability to network analysis.
Below, we briefly discuss the CSS skills we identified as important and include links, reference articles and example papers that can be used by anyone entering this arena.
While there are many programming languages, R seems to have the shortest learning curve and maximizes capability. It is free and open-source, which means resources abound. R is good for data manipulation, visualization and analyses. Moreover, annotated code enhances transparency and replicability, which can be done easily within the R environment. Good starter learning resources can be found at the R Project for Statistical Computing website, Cookbook for R, and R for Data Science.
Data Acquisition and Ethics
A wide range of tools is available for ethically gathering organic online and social media data. Most websites provide documentation (i.e., via a robots.txt file, or licensing documentation) or an application program interface (API) that specifies how one can interact with the data. A variety of R packages that facilitate data acquisition are available at the Comprehensive R Archive Network or through RStudio directly.
Data Visualization and Manipulation
Data visualization skills are relevant for all stages of research, from initially visualizing large amounts of data to displaying results. While Excel can be useful for simple visualizations, the capabilities of R far exceed Excel. A good starting point is the R package tidyverse, which includes a range of tools specific to data science that share an underlying grammatical structure across toolkits.
Text and Network Analysis
Linguistic analysis software identifies language indicators including: themes, emotion, sentiment, evaluation, structure and others. Try the Linguistic Inquiry and Word Count (LIWC, Tausczik & Pennebaker, 2010; the Sentiment Analysis and Cognition Engine (Crossley, Kyle, & McNamara, 2017), and The Evaluative Lexicon; Rocklage, Rucker, & Nordgren, 2018. Network analysis is a natural partner to LAS in that correlation between variables derived from language analysis can be used to form the network (e.g., Eiler et al., 2018). Minimally priced point-and-click programs are available (UCINET; Borgatti, Everett, & Freeman, 2002, NetDraw; Borgatti, 2002) in addition to the more advanced analytics available in R (e.g., statnet, Handcock et al., 2016). For an introduction see Hanneman & Riddle’s (2005) free online book.
For interested readers, we have described the CSS lab experience course in detail, complete with an example syllabus and detailed explanations of learning goals, lab activities and required exercises. Please contact us for a pre-print. We welcome the opportunity to interact with others who want guidance on how to get started.
Below are some helpful references, organized by topic.
General Articles on CSS and Big Data
Chen, E. E., & Wojcik, S. P. (2016). A practical guide to big data research in psychology. Psychological Methods, 21(4), 458–474. doi:10.1037/met0000111
Eiler, B. A., Doyle, P. C., Al-Kire, R. L., & Wayment, H. A. (2018). Teaching introductory computational psychology skills to undergraduate students in a research experience setting. Manuscript submitted for publication.
Grolemund, G. & Wickham, H. (2017). R for data science. Retrieved from: http://r4ds.had.co.nz/
Gorakala, S. K. (October, 2013). Fetch Twitter data using R. Retrieved from https://www.r-bloggers.com/fetch-twitter-data-using-r/
Hargittai, E. (2015). Is bigger always better? Potential biases of big data derived from social network sites. The Annals of the American Academy of Political & Social Science,659, 63-76.
Kim, A. E., Hansen, H. M., Murphy, J., Richards, A. K., Duke, J., & Allen, J. A. (2013). Methodological considerations in analyzing Twitter data. Oxford University Press. doi: 10.1093/jncimonographs/lgt026
Kosinski, M., Wang, Y., Lakkaraju, H., & Leskovec, J. (2016). Mining big data to extract patterns and predict real-life outcomes. Psychological Methods, 21(4), 493-506. doi: 10.1037/met0000105
Landers, R. N., Brusso, R., Cavanaugh, K., & Collmus, A. B. (2016). A primer on theory-driven
web scraping: Automatic extraction of big data from the internet for use in psychological research. Psychological Methods, 21(4), 475-492. doi:10.1037/met0000081
Psychological Studies using CSS Skills
Cavolo, K. M., Eiler, B. A., & Wayment, H. A. (under review). Novel examination of self-evaluation processes through data mining and textual analysis. Journal of Language and Social Psychology.
Eiler, B. A., Al-Kire, R. L., Doyle, P. C., & Wayment, H. A. (in press). Power and trust dynamics of sexual violence: A textual analysis of Nassar victim impact statements and #MeToo disclosures on Twitter. Journal of Clinical Sport Psychology
Gillath, O., Karantzas, G. C., & Selcuk, E. (2017). A net of friends: Investigating friendship by integrating attachment theory and social network analysis. Personality and SocialPsychology Bulletin, 43(11), 1546-1565. doi:10.1177/0146167217719731
Jones, N. M., Thompson, R. R., Schedtter, C. D., & Silver, R. C. (2017). Distress and rumor exposure on social media during a campus lockdown. Proceedings of the National Academy of Sciences of the USA, 144, 11663-11668. doi:10.1073/pnas.1708518114
Jones, N., Wojcik, S. P., Sweeting, J., & Silver, R. C. Tweeting negative emotion: An investigation of Twitter data in the aftermath of violence on college campuses. Psychological Methods, 21(4), 526-541. doi:10.1037/met0000099
Ritter, R. S., Preston, J. L., & Hernandez, I. (2014). Happy tweets: Christians are happier, more socially connected, and less analytical than atheists on Twitter. Social Psychological and Personality Science, 5(2), 243-249. doi:10.1177/1948550613492345
Tamburrini, N., Cinnirella, M., Jansen, V. A., & Bryden, J. (2015). Twitter users change word usage according to conversation-partner social identity. Social Networks, 40, 84-89. doi: 10.1016/j.socnet.2014.07.004
Ueda, M., Mori, K., Matsubayashi, T., & Sawada, Y. (2017). Tweeting celebrity suicides: Users’ reaction to prominent suicide deaths on Twitter and subsequent increases in actual suicide. Social Science & Medicine, 189, 158-166. doi:10.1016/j.socscimed.2017.06.032
Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences of the United States of America, 112(4), 1036-1040. doi:10.1073/pnas.1418680112
Crossley, S. A., Kyle, K., & McNamara, D. S. (2017). Sentiment Analysis and Social Cognition Engine (SEANCE): An automatic tool for sentiment, social cognition, and social-order analysis. Behavior Research Methods, 49(3), 803–821. doi:10.3758/s13428-016-0743-z
Gefen, D., Endicott, J. E., Fresneda, J. E., Miller, J., & Larsen, K. R. (2017). A guide to text analysis with latent semantic analysis in r with annotated code: Studying online reviews and the stack exchange community. Communications of the Association for Information Systems, 41(1), 450–496. doi:10.17705/1CAIS.04121
Pennebaker, J. W., Chung, C. K., Ireland, M., Gonzales, A., & Booth, R. J. (2007). The development and psychometric properties of LIWC2007. Austin, TX: University of Texas at Austin.
Rocklage, M. D. & Fazio, R. H. (2015). The evaluative lexicon: Adjective use as a means of assessing and distinguishing attitude valence, extremity, and emotionality. Journal of Experimental Social Psychology, 56, 214-227. doi:10.1016/j.jesp.2014.10.005
Rocklage, M. D., Rucker, D. D., & Nordgren, L. F. (2018). The evaluative lexicon 2.0: The measurement of emotionality, extremity, and valence in language. Behavior Research Methods, 50, 1327-1344. doi:10.3758/s13428-017-0975-6
Borgatti, S. P., 2002. NetDraw Software for Network Visualization. Analytic Technologies: Lexington, KY
Borgatti, S. P., Everett, M. G. and Freeman, L. C. 2002. Ucinet for Windows: Software for Social Network Analysis. Harvard, MA: Analytic Technologies.
Borgatti, S. P., Mehra, A., Brass, D. J., & Labianca, G. (2009). Network analysis in the social sciences. Science, 323(5916), 892–895. doi:10.1126/science.1165821
Butts, C. T. (2008). Social network analysis: A methodological introduction. Asian Journal of Social Psychology, 11(1), 13–41. doi:10.1111/j.1467-839X.2007.00241.x
Handcock, M. S., Hunter, D. R., Butts, C. T., Goodreau, S. M., & Morris, M. (2003). statnet: Software tools for the statistical modeling of network data. http://statnetproject.org
Hanneman, R. A. & Riddle, M. (2005). Introduction to social network methods. Riverside, CA: University of California, Riverside. Retrieved from: http://faculty.ucr.edu/~hanneman/
Montazeri, F., de Bildt, A., Dekker, V., & Anderson, G. M. (2018). Network analysis of anxiety in the autism realm. Journal of Autism and Developmental Disorders, 1–12. doi: 10.1007/s10803-018-3474-4