A design and policy proposal for improving the democratic quality of social media Marc Smith…
10.30am A categorical model for discovering latent structure in social annotations (Said Kashoob)
Given a collection of web objects, users and tags, can we model the underlying tag generation process?
-Discover implict communities of interest?
-Categories of related tags?
-For given category, id most relevant objs for category
Initial thoughts: content-based topic modeling (Latent Dirichlet Allocation, LSA). Recent work applying LDA models to tags (Wu 2006, Zhou 2008)
Modeling social annotations: the process that generates content is fundamentally different from the annotation process (many authors per “document” = tag collection, not aware of each other)
Community based categorical annotation model (CCA). Communities are groups forming around interests, etc. Each community has a number of categories as its world-view. For each object, a community draws tags from the appropriate underlying categories
Object annotations are generated by communities. Each community selects tags from its category set.
-use Gibbs sampling to recover a joint distribution of tags, categories and communities
-can do inference to find most likely tags per category, per community
Content-based topics vs. tag-based categories
Exploring content vs. annotation: for pairs of objects that are similar in category space, how topically similar are they? Reslt: objects with similar content do not necessarily have similar tags and vice versa
Rubix cube example: objects similar both in category and topic are solutions to the puzzle, objects that are only similar in category are puzzles / games, objects that are only similar in topic are math pages
11am Content-based summarization and categorization in the blogosphere (Ahmed Hassan)
How can we decide which blogs are more important / influential? Given a set of blogs related to a particular topic, find a subset of blog feeds to read that have continued interest in the topic.
Can we use hyperlink popularity based algorithms for speeches and blogs? Yes, but they might not work very well
Use textual similarity to link posts instead of hyperlinks: maybe blog A affects blog B? Given a set of blogs, build a graph where nodes represent posts/feeds and edges link posts/feed with simliar text
Use a pagerank-like measure to calculate importance score of a blog in the similarity network
How can we select nodes that are important but diverse? Add discounting factor based on similarity of node to neighbors
dataset: TREC blog datase
Evaluation: use linear threshold diffusion model! How many blogs covered (activated) by first k blogs in rank. Also split data by time to see how valid is rank(t) for predicting coverage at t+1. Approach also does a little better at precision-at-k on the TREC blog dataset
11.30am Supervised ranking of syntactic configurations for finding targets of semantic expressions (Jason Kessler)
Trying to find targets of sentiment phrases (“while the dealership was friendly, the *car* was a disappointment”)
Sentiment expressions only link to physical targets (“Bill likes to drive the car”)
Two-domain corpus: cars and camera
Baselines – proximity, one-hop with dependency parser
Approach – learn to target from a corpus. Supervised ranking instead of classification. Uses linear-kernel RankSVM, off-the-shelf approach
Results – supervised ranking does better than proximity, one-hop, approaches interannotator agreement
Future work: inter-sentential target
3pm Stochastic models of user-contributory web sites (Tad Hogg)
Focus: describe aggregate group behavior
-determines structure and usefulness of user-participatory sites
-predicting user behaviors
-incentivizing user participation
Stochastic modeling summary:
-Start with individual user behavior, specify states and transitions between states
-Determine collective behavior (details in paper)
Illustration: Stochastic Model of Digg
-Phenomenology: users submit and vote on news stories, Digg promotes popular stories to front page, allows social networking (friends, fans)
Model of Digg voting behavior: visibility and interestingness -> votes. Extension to prior model (Lerman ’07).
– “law of surfing” for viewing web pages (Huberman et al. 98)
– incremental average growth in number of voters’ fans
– construct equation for dynamics of vote volume for a story from state diagram that formalizes visibility and interestingness. Params for vis and interest estimated from story sample. Estimate viewers watching stories from models and data.
Data: front page and upcoming stories since May 06
Modeling story visibility: story location, navigating web sites, number of fans. Each voter enables fans to see story via friends interface.
Modeling story interestingness: topic, novelty, popularity. Can estimate from web-based experiments, e.g. Salganik et al. 06, but can estimate from models and data.
Results: model captures qualitative features – slow growth initially, influence of fans on promotion, rapid growth if story promoted (much more visible to users)
Results: the number of fans have not yet seen the story drops, number of votes on story grows significantly after story gets promoted. “Promotion line” in number of fans / interestingness splits stories into will be / won’t be promoted with 95% accuracy
Predictions from early behavior: can predict #votes from first 4 votes (similar to results for YouTube), but “law of surfing” and incremental growth important parts of model
Conclusions: stochastic process approach connects user and system behaviors, applicable to social media in general when users have limited information and actions, limited use of personalized history.
3.30pm Personal information management vs resource sharing: towards a model of information behavior in social tagging systems (Markus Heckner)
Why do people tag?
Tagging: a fourth layer of indexing? (On top of author keywords, intellectual indexing by information professionals, and auto-tagging)
media type influences tagging: differences in number, language, function of tags btw Connotea, Flickr, YouTube, Delicious
Method: Scientific crowdsourcing using Mechanical Turk
Assumption: Different motivations for taggs
-Organization of one’s own digital content, i.e. personal informational management (Delicious, Connotea), vs. information sharing (Flickr, YouTube)
Questionnaire Design: Question Types
-online questionnaire posted as “human intelligence task.” asks general information, general motivation, tagging motivation and understanding, social bookmarking and search, recent usage.
Data: ~150 subjects, users of Flickr, YouTube, Delicious, Connotea
Motivation: YouTube is significantly weaker-motivated for PIM, Delicious much weaker for sharing, Flickr and Connotea about even
Perception of tagging: Connotea users perceive tagging as most easy, follow by YouTube, Delicious, Flickr (not significant). Connotea user agree very strongly that tagging is a useful feature
Towards a model of tagging behavior: Shneiderman’s approach towards social software (social spheres), etc.
4pm Motivational, structural and tenure factors that impact online community photo sharing (Oded Nov)
Why do people in online communities share? Can we quantify the drivers for sharing (or not sharing) and their effect on actual behavior?
Three types of questions as framework:
Why – drivers of sharing (Motivation, structural properties, personality, privacy concerns)
What – type of information shared (code, content/facts, meta-info (tags), photos)
Where – context of sharing (OSS, Wikipedia, Flickr)
Creation vs. sharing: the act of sharing is separate from the act of creation
People take photos regardless of the sharing act (really?), the “second act” of sharing photos is optional, separate from the “first act” of photo sharing
Identifying the factors in sharing: motivational (extrinsic vs. intrinsic), structural factors (position of user in community network), tenure in community
Motivations: enjoyment (self/intrinsic), commitment to the community (others/intrinsic), self-development (self/extrinsic), reputation (others/extrinsic)
Response variable: artifact sharing per tenure year, IV: motivational vars + structural (number of contacts) + tenure (years since started sharing)
Method: combine user-reported (survey) data and system data: what people say + what people do. N=278, used only “pro” users (>200 photos) with at least 3 months’ tenure on Flickr.
Results: significant positive effect of commitment, negative of self-development, positive of number of contacts, negative of tenure, rest not significant.
Why is enjoyment not correlated with sharing? Users may be motivated more by “fun” of creation rather than content sharing.
Why is correlation between self-development and photo sharing nefative? A tradeoff between contribution quality and quantity? Greater self-development motivation -> focus on the quality of artifacts shared, at the expense of quantity
quality / quantity tradeoff, fun is not an issue?, diminishing sharing
4.30pm Modeling Blog Dynamics (Michaela Goetz)
Blogosphere is a system of interactions: Entities: Bloggers, Posts, Topics
Model: simple set of rules (followed by blogger) that creates these interactions
Evaluation: creating a synthetic blogosphere, comparing it to real blogosphere
Motivation: forecasting, advertising
How is this different from modeling social network? 2 networks combined: Blog vs. Post network, complex temporal dynamics
goal: model micro-level interactions to observe macro-level interactions in blogosphere
Properties of the blogosphere:
-Topological – blog, post = follow power law distribution
-Temporal – user posting activity, popularity over time (link creation)
Burstiness (Slope = 1 of aggregation level vs. entropy) & Self-similarity (Linearity of aggregation level vs. entropy)
Inter-posting time follows a power law
Time t vs. number of in-links t days after publishing follows a power law
Desired model: simple (no parameters), intuitive (local rules), creates realistic topology and dynamics
-inter posting times sampled from exponential distribution, links created using pref attachment – leads to exponential inter-posting distribution, poisson degree distribution (really?), etc.
Second-try solution (Zero-Cost):
In every round, for every blog, user u takes a random walk step, if he reaches 0, he decides to post P
when he posts, he can make a link or not, if he makes a link, he can choose a neighbor based on frequency of links or non-neighbor, then chooses some post of neighbor and links to random posts upward in the cascade
This model accurately reproduces both the topological and temporal patterns (at a qualitative level – same distributions, different though relatively close exponents. Biggest difference: 1.5(sim) vs. 0.7(real) in inter-posting time exponent)