Weibo Visualization

Data

Weibo data includes two object types: user and status. The original data set is obtained from the Sina Weibo OpenAPI, but we only use a small part of it as a test case. They have the following attributes:

  1. User
    1. Name
    2. Number of Followers
    3. Number of Friends
    4. Location: Province, City
    5. Gender
  2. Status
    1. Text
    2. Time
    3. Number of Reposts
    4. Number of Comments

The relations between types of data objects are:

  1. User-Status Relation (Authorship)
  2. User-User Relation (Follow)
  3. Status-Status Relation (Repost)

Data Selection and Acquisition

For our test case, we chose several popular topics and keywords:

  1. 火车票: 火车票, 12306, 抢票
  2. 天气: 天气, PM2.5, 空气
  3. 世界末日: 玛雅, 世界末日, 2012年12月21日
  4. 食品安全: 食品安全, 速生鸡, 肯德基

We searched for given keywords in whole collection of statuses. Since a large portion of statuses in the database have zero repost and comment and are considered to be unimportant, these statuses are filtered out of our test case.

We used these criteria to search in the statuses database:

  1. Keywords
  2. Posted prior to Jan 1, 2013
  3. Reposts count must greater than 3
  4. Only return 2000 statuses

Then we fetch the authors of these statuses.

The final dataset contains 1,425 users, 1,930 statuses posted between Dec 30 8:18 to Dec 23:59, 2012. It is 3.7MB in size.

Design

What We Want to Reveal

Our main goal is to reveal the evolution of popular topics within a period of time, as well as correlations between users and topics. Our design motivation includes:

  1. To show the popularity and evolution of topics, its popularity and the related users.
  2. To show the distribution in terms of location, gender, topics and time.
  3. Discover critical users and statuses to this topics.
  4. To show the sentimental distribution and trend.

Views

  1. Bubble chart of statuses, grouped by topics and time
    1. Y position: time, grouped by intervals
    2. X position: first group by topic, then sort by sentiment
    3. Color: topic
    4. Size: number of reposts and comments
  2. Bubble chart of users, grouped by location (province)
    1. Color: gender
    2. Size: number of followers
    3. Stacked area chart to show overview of topic evolution
    4. Links between users and statuses
  3. Stacked area chart to show overview of topic evolution, both in absolute value and in percentage

Interaction

  1. Foldable time intervals: expand to see detailed time-wise distribution of statuses
  2. Focus on single user and his posts
  3. Multiple selection of topics
  4. Multiple selection of provinces

Screenshots

Expand time groups.

Filter users by province.

Filter statuses by topic.

Select a user and his posts.

Filter by user and topic.

Implementation

Technologies

This visualization is implemented using web technologies: HTML, CSS and JavaScript, with D3.js library. Data is stored in MongoDB, and processed and provided through a Node.js web server.

Findings

  1. Users from Beijing, Guangdong and Shanghai contribute most to the selected topics.
  2. Quiet during 2am to 6am — sleep time.
  3. People talked more about the Apocalypse towards the end of the year, usually in summary of 2012.
  4. Ads and promotions get lots of attention and responses.

What We Learned

  1. Understanding of data: clarify what to expect from the visualization before anything else (design, data preparation, etc)
  2. With a huge amount of elements in the visualization, UI performance becomes an issue. Consider unconventional UI programming hacks for performance, though at the cost of code extendability and maintainability (violating DRY principle).
 
public_course/visclass_f12/project/group_1/start.txt · Last modified: 2013/01/23 00:45 by ye.cheng