Showing posts with label regression. Show all posts
Showing posts with label regression. Show all posts

Wednesday, August 21, 2013

Facebook Likes and Human Behavior

Earlier in 2013, the Psychometrics Centre of the University of Cambridge conducted an impressive research about how private attributes are predictable from digital records of human behavior, like Facebook likes. Digitally mediated behaviors like Facebook likes can easily be recorded and analyzed, fueling the emergence of computational social science and new services such as personalized search engines and targeted online marketing. However, the widespread availability of extensive records of individual behavior, together with the desire to learn more about customers and citizens, presents serious challenges related to privacy and data ownership. 

The researchers (Kosinski, Stinwell & Graepel) showed that easily accessible digital records of behavior, Facebook Likes, can be used to automatically and accurately predict a range of highly sensitive personal attributes including sexual orientation,ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender. The analysis presented is based on a dataset of over 58,000 volunteers who provided their Facebook Likes, detailed demographic profiles, and the results of several psychometric tests. The proposed model uses dimensionality reduction for preprocessing the Likes data, which are then entered into linear regression to predict individual psychodemographic profiles from Likes. The model correctly discriminates between homosexual and heterosexual men in 88% of cases, African Americans and Caucasian Americans in 95% of cases, and between Democrat and Republican in 85% of cases.


Predicting individual traits and attributes based on various cues, such as samples of written text, answers to a psychometric test, or the appearance of spaces people inhabit, has a long history. Human migration to digital environment renders it possible to base such predictions on digital records of human behavior. It has been shown that age, gender, occupation, education level and even personality can be predicted from people’s Web site browsing logs. Similarly, it has been shown that personality can be predicted based on the contents of personal web sites, music collections, properties of Facebook or Twitter profiles such as the number of friends or the density of friendship networks, or language used by their users. Furthermore, location within a friendship network at Facebook was shown to be predictive of sexual orientation.


The research 



The Psychometrics Centre, University of Cambridge

The study was based on a sample of 58,466 volunteers from the United States, obtained through the my Personality Facebook application, which included their Facebook profile information, a list of their Likes (n = 170 Likes per person on average), psychometric test scores, and survey information. Users and their Likes were represented as a sparse user–Like matrix, the entries of which were set to 1 if there existed an association between a user and a Like and 0 otherwise. The dimensionility of the user–Like matrix was reduced using singular-value decomposition (SVD). 

Numeric variables such as age or intelligence were predicted using a linear regression model, whereas dichotomous variables such as gender or sexual orientation were predicted using logistic regression. In both cases,the researchers applied 10-fold cross-validation and used the k =100 top SVD components. For sexual orientation, parents’ relationship status, and drug consumption only k = 30 top SVD components were used because of the smaller number of users for which this information was available.

Prediction of Dichotomous Variables

The Psychometrics Centre, University of Cambridge
The aforementioned figure shows the prediction accuracy of dichotomous variables expressed in terms of the area under the receiver-operating characteristic curve (AUC), which is equivalent to the probability of correctly classifying two randomly selected users one from each class (e.g., male and female). The highest accuracy was achieved for ethnic origin and gender. African Americans and Caucasian Americans were correctly classified in 95% of cases, and males and females were correctly classified in 93% of cases, suggesting that patterns of online behavior as expressed by Likes significantly differ between those groups allowing for nearly perfect classification. Christians and Muslims were correctly classified in 82% of cases, and similar results were achieved for Democrats and Republicans (85%). Sexual orientation was easier to distinguish among males (88%) than females (75%), which may suggest a wider behavioral divide (as observed from online behavior) between hetero and homosexual males.
Good prediction accuracy was achieved for relationship status and substance use (between 65% and 73%). The relatively lower accuracy for relationship status may be explained by its temporal variability compared with other dichotomous variables (e.g., gender or sexual orientation).

Predictive Power of Likes


Individual traits and attributes can be predicted to a high degree of accuracy based on records of users’ Likes. The best predictors of high intelligence include “Thunderstorms,” “The Colbert Report,” “Science,” and “Curly Fries,” whereas low intelligence was indicated by “Sephora,” “I Love Being A Mom,” “Harley Davidson,” and “Lady Antebellum.” Good predictors of male homosexuality included “No H8 Campaign,” “Mac Cosmetics,” and “Wicked The Musical,” whereas strong predictors of male heterosexuality included “Wu-Tang Clan,” “Shaq,” and “Being Confused After Waking Up From Naps.” 



Accuracy of selected predictions as a function of the number of available Likes. Accuracy is expressed as AUC (gender) and Pearson’s correlation coefficient (age and openness). About 50% of users in this sample had at least 100 Likes and about 20% had at least 250 Likes. Note, that for gender (dichotomous variable) the random guessing baseline corresponds to an AUC = 0.50. The Psychometrics Centre, University of Cambridge.


Moreover, note that few users were associated with Likes explicitly revealing their attributes. For example, less than 5% of users labeled as gay were connected with explicitly gay groups, such as No H8 Campaign, “Being Gay,” “Gay Marriage,” “I love Being Gay,” “We Didn’t Choose To Be Gay We Were Chosen.” Consequently, predictions rely on less informative but more popular Likes, such as “Britney Spears” or “Desperate Housewives” (both moderately indicative of being gay).

Conclusions


Similarity between Facebook Likes and other widespread kinds of digital records, such as browsing histories, search queries, or purchase histories suggests that the potential to reveal users’ attributes is unlikely to be limited to Likes. Moreover, the wide variety of attributes predicted in this study indicates that, given appropriate training data, it may be possible to reveal other attributes as well.

Predicting users’ individual attributes and preferences can be used to improve numerous products and services. For instance, digital systems and devices (such as online stores or cars) could be designed to adjust their behavior to best fit each user’s inferred profile. Also, the relevance of marketing and product recommendations could be improved by adding psychological dimensions to current user models. For example, online insurance advertisements might emphasize security when facing emotionally unstable (neurotic) users but stress potential threats when dealing with emotionally stable ones. 



Moreover, digital records of behavior may provide a convenient and reliable way to measure psychological traits. Automated assessment based on large samples of behavior may not only be more accurate and less prone to cheating and misrepresentation but may also permit assessment across time to detect trends. Moreover, inference based on observations of digitally recorded behavior may open new doors for research in human psychology.


On the other hand, the predictability of individual attributes from digital records of behavior may have considerable negative implications, because it can easily be applied to large numbers of people without obtaining their individual consent and without them noticing. Commercial companies, governmental institutions, or even one’s Facebook friends could use software to infer attributes such as intelligence, sexual orientation, or political views that an individual may not have intended to share. One can imagine situations in which such predictions, even if incorrect, could pose a threat to an individual’s well-being, freedom, or even life. Importantly, given the ever-increasing amount of digital traces people leave behind, it becomes difficult for individuals to control which of their attributes are being revealed.


There is a risk that the growing awareness of digital exposure may negatively affect people’s experience of digital technologies, decrease their trust in online services, or even completely deter them from using digital technology. It is our hope, however, that the trust and goodwill among parties interacting in the digital environment can be maintained by providing users with transparency and control over their information, leading to an individually controlled balance between the promises and perils of the Digital Age.