Create a corpus¶
twittermarkov corpus command will create such a file from a Twitter archive, with options to ignore replies or retweets, and to filter out mentions, urls, media, and/or hashtags.
“Corpus” is just a fancy-schmancy word for “a bunch of text”. twittermarkov expects a corpus that’s a text file with one tweet per line. Several thousand lines are needed to get decent results, with fewer than 100 or so it won’t work at all
When reading an archive, these arguments use the tweet’s metadata to precisely strip the offending content. This may not work well for tweets posted before 2011 or so. For text files or older tweets, a regular expression search is used.
All the filtering options:
--no-retweets- skip retweets
--no-replies- filter out replies (keeps the tweet, just removes the starting username)
--no-mentions- filter out mentions
--no-urls- filter out urls
--no-media- filter out media
--no-hashtags- filter out hashtags
If you’re using a Twitter archive, the archive argument should be the tweet.csv file found in the archive folder (which usually has a long name like