IBM researchers have developed an algorithm that predicts your home location using your last 200 tweets. |
But it also raises privacy issues, particularly
when users are unaware, or forget that, their tweets are geotagged.
Various celebrities are thought to have given away their home locations
in this way. And in 2007, four Apache helicopters belonging to the US
Army were destroyed by mortars in Iraq when insurgents worked out their
location using geotagged images published by American soldiers.
Perhaps
these kinds of concerns are the reason why so few tweets are geotagged.
Several studies have shown that less than one per cent of tweets
contain location metadata.
But the absence of geotagging
data does not mean your location is secret. Today, Jalal Mahmud and a
couple of pals at IBM Research in Almaden, California, say they’ve
developed an algorithm that can analyse anybody’s last 200 tweets and
determine their home city location with an accuracy of almost 70 per
cent.
That could be useful for researchers, journalists,
marketers and so on wanting to identify where tweets originate. But it
also raises privacy issues for those who would rather their home
location remained private.
Mahmud and co’s method is
relatively straightforward. Between July and August 2011, they filtered
the Twitter firehose for tweets that were geotagged with any of the
biggest 100 cities in the US until they had found 100 different users
in each location.
They then downloaded the last 200
tweets posted by each user, rejecting those that posted privately. That
left them with over 1.5 million geotagged tweets from almost 10,000
people.
They then divided this data set in two, using 90
per cent of the tweets to train their algorithm and the remaining 10
per cent to test it against.
The basic idea behind their
algorithm is that tweets contain important information about the
probable location of the user. For example, over 100,000 tweets in the
dataset were generated by the location-based social networking site
Foursquare and so contained a link that gave the exact location. And
almost 300,000 tweets contained the name of cities listed in the US
Geological Service gazetteer.
Other tweets contained
clues to their location like phrases such as “Let’s Go Red Sox”, a
reference to the Boston-based baseball team. And
Mahmud and co say that distribution of tweets throughout the day is
roughly constant across the US, shifted by time zone. So a user’s
pattern of tweets throughout the day can give a good indication of which
time zone they’re in.
So the question these guys
set out to answer was whether it was possible to use this information
to predict a user’s home location, a result they could test by matching
it against the user’s geotagged metadata.
Mahmud and co
used an algorithm known as a Naive Bayes Multimonial to do the number
crunching. The trained it by feeding it the training dataset along with
the geolocation data.
They then tested the algorithm on the remaining 10 per cent of the data to see whether it could predict the geolocation.
The
results are interesting. They say that when they exclude people who are
obviously travelling, their algorithm correctly predicts people’s home
cities 68 per cent of the time, their home state 70 per cent of the time
and their time zone 80 per cent of the time. And they say their
algorithm takes less than a second to do this for any individual.
That
could be a useful tool. Journalists, for example, could use it to
determine which tweets were coming from a region involved in a crisis,
such as an earthquake, and those that were just commenting from afar.
Marketers might use it to work out the popualrity of their products in
certain cities.
And it also suggests ways that people can improve their privacy–by not mentioning their home location, of course.
Mahmud
and co say their algorithm could do better in future. For example, they
think they can get more fine-grained detail by searching tweets for
mentions of local landmarks that can be pinpointed more accurately.
Whether that turns out to be possible, we’ll have to wait and see.
An
interesting corollary to all this is that our notion of privacy is more
fragile than most of us realize. Just how we can strengthen and protect
it should be the subject of considerable public debate.
Ref: arxiv.org/abs/1403.2345 : Home Location Identification of Twitter Users
No comments:
Post a Comment