Twitter

From Just Solve the File Format Problem
Revision as of 16:13, 26 February 2013 by Dan Tobias (Talk | contribs)

Jump to: navigation, search
File Format
Name Twitter
Ontology

Twitter is a popular social-networking and messaging service, accessible through the web and mobile device apps, allowing users to write 140-character messages publicly or privately. Often the messages include hyperlinks, which get sent through URL shorteners (so they might suffer linkrot if the shortening services go away). Some of the conventions of the service are discussed in the article on Hashtags, at-signs, retweets, etc.

Of interest to archivers is the fact that, as of late 2012, Twitter has started rolling out a feature to permit users to save their entire tweet history as an archive file.

Twitter is also one of the engulf-and-devour Internet megacorporations now which has swallowed up, digested, and excreted other Internet services, a 2013 example being Posterous.

Downloaded Twitter archive

If you have been given the option to download your Twitter history (it has been given gradually to users, so you may or may not have this option now yourself, but probably will in the future if you don't now), it appears as a button at the bottom of the "Settings" page in your account. Pressing it causes the generation of an archive of your tweets to be queued, and when it is finished (minutes? hours? whenever?) you get e-mailed at the registered address associated with the account with a link to retrieve your archive. There, you can download it as a ZIP archive (tweets.zip) containing this file and directory structure:

  • README.txt: an ASCII text file (with long lines that scroll way off to the right if your text viewer doesn't wrap long lines) giving some information about the format
  • index.html: HTML file which, when loaded in a browser, lets you view your tweets. The tweets themselves aren't actually in this file, but it pulls in a bunch of JavaScript from the subdirectories, which in turn load the tweets from data files.
  • css: Subdirectory with Cascading Style Sheets.
    • application.min.css Stylesheet (formatted in hard-to-read manner with no line breaks)
  • data: Subdirectory with data files.
    • csv: Subdirectory with CSV files.
      • YYYY_MM.csv: A series of files named by year and month with the tweets in the form of comma-separated values (CSV). The columns are: "tweet_id", "in_reply_to_status_id", "in_reply_to_user_id", "retweeted_status_id", "retweeted_status_user_id", "timestamp", "source", "text", "expanded_urls". The timestamp is in UTC time, in the format YYY-MM-DD HH:MM:SS +0000.
    • js: Subdirectory with JavaScript (user-specific, encoding details about the tweets).
      • payload_details.js
      • tweet_index.js
      • user_details.js
      • tweets
        • YYYY_MM.js: A series of files named by year and month with the tweets in JSON form, with a one-line header turning each file into a JavaScript variable assignment. (Strip that line if using the JSON data elsewhere.)
  • img: Subdirectory with graphics.
    • bg.png: A PNG graphic used as a background.
    • sprite.png: A PNG graphic with sprites used by the scripts.
  • js: Subdirectory with JavaScript.
    • application.min.js: Script used in displaying tweets (formatted in a hard-to-read manner with no line breaks).
  • lib: Subdirectory with various 'library' files used by the scripts.
    • bootstrap: various JavaScript, CSS, and graphics.
    • hogan: Contains another JavaScript file.
    • jquery: Contains another JavaScript file.
    • twt: Contains some more JavaScript, CSS, and graphics.
    • underscore: Contains another JavaScript file.

Links and references

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox