Using big data for one’s research is considered to be the “it” thing nowadays, especially in geography. However, understanding the process can be rather difficult for people who are somewhat tech savvy, but have little to no programming background. There’s so much information out there, but it can be difficult to follow for the newbie. Many of the articles written will say “for the beginner” but it can still go beyond one’s existing knowledge base.
By the beginning of the year, I would like to conduct an automated twitter collection of tweets about the Veterans Health Administration. I’ve been trying to figure this out for a couple of months now and I finally was able to figure it out. I’ll give a rough outline of the process I went through and explain in detail what I did in subsequent posts. I’ll also post a list of important keywords on a separate page for those who are like me and are just getting started with all of this.
- Choose the program you want to use. For me it was R vs Python. R won due having to use it for a class I took this last semester so I am more familiar with it.
- Deciding whether want a constant collection of tweets or not. You will have to use something called an Application Program Interface(generally referred to as an API) in order to do this. I think of an API as the middleman between me and Twitter. I have a request and the API will access Twitter to process my request. Twitter has pretty detailed information on how their API works and I highly recommend you read it. I would focus on the OAuth, REST APIs and Streaming API sections.
- Write the script for the program. You will have to download and install packages that allow you to have access to the Twitter API in whatever programming language you choose. Look over the documentation regarding the package so you have an idea on how the package works. If you don’t completely understand, it’s okay! Having even a slight idea on how the package works is better than having no idea on how it works.
- Do a few test runs so you can get an idea of how your code works. Try to collect tweets for five minutes, ten minutes, an hour and three hours. Notice the quality and quantity of the information you get based on how long you keep your program running.
- Find out how to automate your code. This really varies on the programming language and operating system. I am concerned with automating R script on a Mac. If you are using a Mac, you will need to use something called launchd in order to do this. I would recommend purchasing a program called LaunchControl which makes things a lot easier. You will need to create a job using this software to run your program for a certain time interval.
- Do a test run of the program using launchd or LaunchControl for a certain time period. I ran the program for a day just to see the average number of tweets.
- Purchase an external hard drive to store your tweets so you don’t eat up all the memory on your computer.
I will warn that you will get frustrated with the process. Don’t give up! I definitely did and started to backtrack here and there or tried to add more elements to the process than I needed to. If you get overly frustrated, just walk away from your computer for a while and come back to it once you’re in a better state of mind.