This is a post based on the talk given at Warsaw Data Science Meetup on April 12th 2016.
It’s difficult to make predictions, especially about the future.
Nowadays, whenever we hear about a recent breakthrough in science, technology or business, it is more than likely that the so-called ‘big-data’ is part of the success. Whether it’s a new medical algorithm to discover early stages of cancer or a new method for reducing traffic in big cities, massive amounts of digital data are very often a ‘secret ingredient’. As a matter of fact, everyone of us has probably heard at least a couple of the estimates on how much data human mankind creates every second, every minute or every hour. If you still have trouble grasping the breadth of this process, have a look at http://pennystocks.la/internet-in-real-time/ where they show the ever increasing amount of information created online in real time.
Now that I convinced you that there is an abundance of data available at our fingertips, how about explaining why it might be interesting to predict which part of this content will be popular? By popular, I mean accessed, shared, modified, retrieved, etc. multiple times. Imagine that you are storing your data on your local computer and people who want to download will need to get it directly from your hard drive. How much data transfer you will need to get your data to all the interested parties depends on how popular your data will be. Let’s take another, more realistic example (at least for 1.5 billion people who use Facebook): you take a picture and you upload it on Facebook. How many likes will your picture get?
As this is clearly a vital problem for humans across the world, many researchers have looked into it. One of the most interesting works entitled “What makes an image popular” by A. Khosla, A. Das Sarma and R.Hamid from MIT tries to answer this question by looking into visual and social features of the images from Flickr. They gather a dataset of over 2.3M images from different users and their corresponding popularity metric - number of views. Below you can find a set of example pictures from their dataset:
First thing that they analysed was the popularity distribution. As a matter of fact, the vast amount of pictures makes it impossible for all of them to become popular. Only a small percentage of the pictures will actually become popular, while the majority will remain seen only by a few. This is reflected in the long-tail character of the popularity distribution graph. To deal with this variation within the data authors transformed view counts using logarithmic function and normalized them by the time passed since publication. The resulting distribution curve looks much more balanced and now they could get back to the original question: what makes an image popular?
Khosla et al. started by looking into the visual features such as the colors present in an image, edge and gradient distributions as well as outputs of deep convolutional neural networks - machine learning algorithms proved successful in image classification tasks such as ImageNet Challenge. The results show that:
- greenish and blueish colors tend to have lower importance when predicting image popularity compared to more reddish colors. This may be due to the fact, that more striking colors attract more attention
convolutional neural networks and their outputs (object detection results) provide an important insights into the popularity metrics of the image. For instance, objects like miniskirt, bikini or perfume exhibit strong positive impact while spatula, plunger or laptop has a rather negative impact on image popularity
when using all visual features available before the publication of the photo, the authors obtained Spearman correlation rank of up to 0.4 (in the scale of magnitude of 0 to 1), which suggests that visual content plays a role in its popularity.
They also analysed the social cues, such as mean number of views received by the photos posted by the users as well as number of tags and photo description length. They concluded that social cues play even more important role in popularity prediction than visual cues and the corresponding correlation reaches up to 0.77.
Finally, they implemented a demo for anyone to use and predict the popularity of his/her photo before it is published online. They have also showed a few sample images aligned according to their true popularity as well as their predicted popularity:
When talking about pictures online one cannot forget about one of the most popular type of pictures (if not THE most popular one): #selfies. So how should we take our selfie to make it popular? Andrej Karpathy tackled this exciting problem in his blogpost, where he trained a deep convolutional neural network to analyse hundreds of thousands of selfies and figure out which ones are popular and why.
He gathered a dataset of over 2 million selfie images from the Internet and split them into two categories: good and bad, according to their popularity. The resulting sample of pictures looks like that:
He then fine-tuned a pre-trained deep convolutional neural network that won one of the recent ImageNet challenges, namely VGG-net, to perform a binary classification of good/bad selfie. He held out a test dataset of 50 000 pictures and asked the neural network what its opinion on those pictures was. Theses are the results:
In the top-100 images according to the neural network, all of them features females. Most of them took more than 1/3 part of the image and, what is especially interesting, most of the people shown in the pictures had their foreheads cut off. When looking into male selfies, the network picked pictures of guys with fancy hair style combed upwards, significant part of them naked with both shoulders within the frame. Well… apparently that’s what makes you look better!
Also, Karpathy looked at the worst 100 selfies and quickly discovered that they share several characteristics - most of them were under-exposed with too large heads and group shots.
The network was also implemented as a twitter bot to help users take the best crop of their selfie (according to the network). The results show that for some cases, the best selfie… is a selfie without the author at all! :)
Data Science @ Tooploox
In Tooploox, we work with companies that span multiple domains, countries and continents. One of our client is a media company focused on providing the best video content to its users via social media. Therefore, we also looked into the problem of popularity prediction of online content taking into account videos.
Similarly to photos, some videos get popular, while the others - don’t. Well, a good example of the former is a video of baby panda bear sneezing that was watched over 220 million times.
Now, what can we do to predict the popularity of the video published online? Some early works suggest that one way is to look at the popularity just after publication, as early view patterns reflect long-term interest. For example, Szabo et al. in their paper entitled Predicting the popularity of online content analysed the view counts of YouTube videos within 30 days after publication. The resulting correlation suggest that the prediction of the popularity after 30 days can be done just after 7 days with fairly high precision.
In our work, we extended this analysis and computed several visual features of the video, such as dominant color, scene dynamics, clutter metrics and textual features present in the video. Since the videos that we analysed are distributed via social media, we also looked into social features such as number of comments, shares and likes. We have plugged all the available data (view counts, social features and visual features) into a Support Vector Regression (SVR) method and compare the resulting prediction accuracy against the state-of-the-art methods. As a matter of fact, SVR based on view counts provides a fairly good prediction estimate, but it can be improved when extending input features with social and visual cues.
To put it into a nutshell, if you want to make predictions in the world of online content, don’t forget that it’s not only who takes a picture/video, but also who shares it or who comments on it. Thank you for reading and good luck with making your predictions! As they say: it’s not that difficult to make predictions, especially about the images and videos online ;)