Visual and textual representation of immigration in the Hungarian online media
In the autumn of 2016, the referendum on the so-called forced-settlement of migrants was looming over our heads. Media was certainly putting an enormous pressure on society; but what was this whole fuss about? Although modern computational linguistics cannot come up with exact answers, it may assist us in getting an idea of the wide range of emotions stirred by various sites. Precognox research team is presenting their profound analysis on how media is attempting to sell the referendum.
Originally published: nyest.hu, September 29, 2016
It is indisputable that although a huge number of refugees reached Hungary, yet we could not bump into them around every corner since most of them almost immediately left the country. For the public, it is the media that represents a direct link to the refugees, so we wanted to find out how the news on migration are presented in the online media. We analyzed more than 40000 articles published between Sep 27, 2014 and July 11, 2016 with text mining and image processing methods. The texts and their metadata are available in a searchable form on our dashboard. In this article, we are giving a broad outline of what possibilities the dashboard provides. We are also trying to present the information content of the images in an easily understandable form.
As opposed to the mostly qualitative research widely applied in media content and media representation analysis, we used methods which support the automatic processing of large amount of data as well as the simultaneous analysis of both visual and textual content. This way the research period can be extended and the number of content providers can be increased. The simultaneous analysis is to be completed in our intern’s thesis on the fine-tuning of cluster analysis application and evaluation. The aim is to make the interpretation of both textual and the increasing proportion of visual content easier in the future.
The necessary data was collected from 25 online news sites including the prominent index.hu, origo.hu, the online version of mno and hvg as well as minor portals. The selected sites cover a wide spectrum of the Hungarian online media; articles have been taken from pestisracok.hu, abcug.hu, kuruc.info as well as from the popular pages of yellow journalism. We also collected data from online TV channels- atv.hu, rtl.hu, hirek.hu- and the official police reports from police.hu.
The number of articles in the corpus based on content providers
We could find the articles related to migration on most sites with their own search engines but there were some cases when it was not feasible. In lack of- or besides- the search function, labels and headings guided us to find the relevant content. First, we collected the article URLs manually with Link Klipper Google Chrome extension. Then, having these references, we automatized the crawling process of both the visual and the textual content.
The number of articles in the corpus based on keywords
To be able to interpret the composition of the corpus, it is essential to mention the process of how the URLs and the content are filtered since it decreased the reference list by 30 thousand items. During the process, several methods were used to add relevant and unique articles to the corpus. We got rid of invalid links which led to either recommendations or sites listing search results with simhash. Duplications within one domain were filtered with similarity measures based on tf-id statistics. We also removed duplicated URLs. We applied the heuristic technique to filter duplicates; the article published the earliest was left in the corpus. We also discarded the articles with no timestamp where, for this reason, the date of publication could not be identified. Although we did our best to eliminate irrelevant articles with our statistical tools, we cannot be certain to have only proper content. It is also important to remember that based on the corpus composition only careful conclusions can be drawn concerning either the number of articles published on a certain subject, or which site was the most active to represent a given topic. The reason for this is that adding an article to the corpus meant its real publication, the fact that we could use the crawling method on the site, and that the content met the filter criteria: it was not identified as a duplicate, or it wasn’t an invalid link and had a timestamp. At the end of the process what we got was a corpus of 42845 articles.
We worked with the following keywords: “immigration”, “immigrant”, “migrant”, “migration”, “refugee” and “asylum-seeker”. Almost half of the articles in the corpus are hits for the keyword “refugee”. The second most frequent was “immigration” followed by “immigrant”, “migrant”, “migration” and “asylum-seeker”. Unique labels and headings were used in case of three sites: on kuruc.info we collected articles under the heading “immigrant crime” as well besides the keywords mentioned above. On “kettősmérce.blog.hu” the column “immigrant affairs” was great help to find the relevant articles. On blikk.hu the label “refugee crisis” was used to get the news on this topic.
Word usage is a crucial element of media representation research. The modality of expressions can be either alienating or fear-provoking. It would certainly be wrong to jump to far-reaching conclusions in the lack of context and judge the strategies content providers used to present refugee-affairs based exclusively on the keywords. Below however, we can see the hit results of our keywords categorized by sites: which expressions were preferred and which ones were ignored.
Hit results for keywords based on sites
To get a more profound understanding after the descriptive analysis of the corpus we also completed the content analysis of the articles as well. In the pre-work phase the first step was to remove parts with incorrectly coded characters. Then, with the use of magyarlánc we stemmed the words and carried out part of speech tagging: we classified words into their parts of speech and labelled them. To achieve more relevant results, we removed certain words- the most frequently used ones due to natural language usage- with the help of a stopword list.
To present the results of our complex analysis an interactive dashboard was created which hopefully completes, corrects or specifies our general suspicion on the representation of the refugee crisis and gives an overall picture of the world the Hungarian online media shows about immigration.
We can easily get an idea of how media reacted to immigration based on the time and the number of the articles published. This phenomenon was clearly presented on the dashboard created for the text corpus which showed an increase from May 2015 in the number of published articles. Most of them were created between the end of August and the middle of September 2015. From October 2015 to May 2016 articles were being published evenly, then in July 2016, right at the end of the collecting period, another rise can be seen.
Time distribution of all news
It is possible to find the words and expressions used in the articles with the Search field on the sites. For instance, if we search for the words “refugee”, “migrant”, “migration” or “immigrant” what we get is a trend fairly similar to the original one. However, there are expressions which were not typically used during the whole period. One such instance is “immigrant-for-a-living” which can be found by searching for “living” AND “immigrant” or “migrant crime” in the keywords category. If we check the timeline of these expressions we can see that this phrase was favored roughly until the middle of 2015, mostly in the news of nepszava.hu. The tag “immigrant crime” became a pet expression on kurucinfo.hu from the early 2016.
Time distribution of articles with the phrase “immigrant-for-a-living”
Time distribution of “immigrant crime” tag
We can also find words which are more generally connected to the topic. Such as “immigration” for example where the time distribution has a peak in several places indicating the unfolding of the phenomenon well before media attention.
Time distribution of articles containing the phrase “immigration”
It is important to identify the emotions and attitudes evoked by events when analyzing the discourse of online media. Although journalists generally aim to be objective and neutral, the phrases they use are often giveaways of their mindset- not to mention articles where the opinion of the author is far from being disguised.
With the two tabs of the dashboard it is possible to study the sentiments and emotions identified in the articles. During the sentiment and emotion analysis our goal was to identify opinions, attitudes and emotions expressed by the articles. Sentiment analysis normally uses 3 categories (negative, neutral and positive) or their various stages while emotion analysis tries to detect the 6 basic human emotions (sadness, anger, joy, disgust, fear and surprise). We used our Precognox dictionaries to identify sentiments and emotions. The sentiment dictionaries are available free here for research purposes. Although the emotion dictionaries can still be improved on and therefore should be used carefully, they are appropriate for a rough analysis.
To receive the sentiment or emotive value of an article we divided the number of words identified by our dictionaries with the total number of words. We got a value between 0 and 1 for negative and positive sentiments respectively for each article as well as for the emotions of sadness, anger, joy, disgust, fear and surprise. Then we added the positive and negative values. The cumulative sentiment had a value between -1 and 1 within one article. However, on the dashboard the values of articles published on a specific day are summed up- that’s why the sentiment values may range from 10 to -8.
The cumulative sentiment of the news on immigration is neither positive nor negative in nature. The daily value is rather neutral with only one or two peaks. When positive and negative sentiment values are considered separately we can see that they are both represented in significant numbers. When summing them up however, they cancel each other out. This means that the sentiments of the collected sources cover a wide range of spectrum and with some exceptions they are balanced.
Time distribution of cumulative sentiments
Time distribution of negative sentiments
Time distribution of positive sentiments
Based on the emotion timelines sadness and fear are the first to be revealed in the news. Since however, the dictionaries differ in length, comparing the volume of emotions should be done with care. When looking for a certain date with the Time window panel it is possible to read the news published on the very date. Also, we can find out what event triggered the increase of emotions. For instance, 31 August 2015 was a day when both sadness and fear were at a high peak. We can see that lots of articles were focusing on the following topics: a humanitarian catastrophe due to the refugees gathered at the Keleti station, the congestion on both public roads and railways, the negative reception of Hungary’s immigration policy, the rejection of the quota system, the number of refugees entering the country, the high alert of border control and the impossible situation of volunteers in the transit zones.
Time distribution of sadness and fear
It is also worth checking the domains to see which sentiment or emotion dominates the online news portals. Let us look at 444.hu where all sentiments except surprise show a constant radical shift similarly to the cumulated value which also changes dramatically between the positive and negative sentiments.
Besides the timelines, the words belonging to given sentiments and emotions are also shown on the Dashboard. Let’s look at two examples: expressions like “unpleasantness”, “problem”, “war”, “terrorist” and “illness” are typical in news where negative sentiments are dominant. In articles where the emotion “fear” is powerful, words like “concern”, “dread”, “terror” and “worry” appear in the greatest number.
Word cloud of negative sentiments
Word cloud of fear
To make the content of more than 40.000 news more manageable thematic groups sharing the same semantic features were created. For this process, we used the Mallet tool’s topic model called Latent Dirichlet Allocation (LDA). The classification process of the LDA algorithm is based on how the words in the document are distributed. Naming the topics is done by analysts. The output of the algorithm is two lists: one containing the most typical words used in each topic, and another one showing the rate of how the various topics are represented in each document. We got as many as 47 topics altogether which were named based on either their keywords or their most typical news. When modeling a topic each piece of news gets assigned to each topic to a certain extent. It may be more prominent in case of 1-3 items and may be relatively insignificant in case of others. For the sake of simplicity each piece of news was assigned to the most relevant topic. Therefore, we may have the impression in some cases that only few sentences refer to the given topic but all in all this method gives a good model of the thematic structure of the corpus.
The dashboard created for the texts and their metadata has a separate tab for topic analysis. Here’s the list of 15 topics embracing the most news, the number of the news in parenthesis:
The dashboard clearly shows which words are typical and which positive and negative expressions are favored when a certain topic is being discussed. For instance, the most frequently occurring words of the topic “The EU-Turkey Refugee Deal” are the following: “unio”, “refugee”, “state”, “world” and “role”. As for negative words “burden”, “nuisance”, “inconvenience”, “problem” stand out while the positive ones are: “important”, “free”, “entitled” and “respect”. As opposed to this here are the words the topic “Catching illegal immigrants and human traffickers” had in the greatest number: “police officer”, “police station”, “male”, “illegal” and “Syrian”. The word “forbidden” is the most important negative one whereas the positive expressions seem somewhat insignificant.
We chose two topics out of the 47: “The criticism of the EU’s immigration policy (FIDESZ-KDNP)” and “Liberal attitude with the migrants”. These topics are the subject of a further analysis at the end of this article.
With DBpedia Spotlight we extracted the name entities from the collected articles (Named Entity Recognition) and we examined three types: personal names, geographical names and institution names. We created graphs where the nodes represent the entities and the edges show that they have been mentioned together in one article.
The graph of personal names contains a relatively high number of nodes- 2345 entities altogether- with 13473 edges. For the sake of clarity here are some informative graph parameters: the average path length is 3,3, the diameter- the distance of the two farthest nodes- is 10, the clustering coefficient- which indicates how frequently two nodes that are both connected to a third one is connected- is 0,75. Since we have a relatively complicated network, it seemed practical to reduce its size during the analysis and the representation to make the central nodes more visible. Therefore, the graph below shows nodes with at least 12 connections which is above the average degree in the original network. Each of them belongs to the giant component of the network- i.e. there are no isolated nodes and there must be at least one edge between any two random entities.
In case of the name-graph numerous relevant groups can be identified. Among them the ones with political characteristics are the most dominant- these form the central core which is the biggest related component. Also, the entities with the highest degree can be found here. The impressive blue cluster in the center of the graph is basically the collection point of the Hungarian political scene. Prime Minister Viktor Orbán has the highest degree not only here but also in the entire graph. Other key characters of the Fidesz regime with a relatively high degree are Péter Szijjártó, Antal Rogán, János Lázár, together with other past and present party leaders such as Gábor Vona or Ferenc Gyurcsány. The political elite of Western Europe also form a well-defined block (magenta). The graph shows that within the same cluster politicians of either similar or rather different opinions on migrants are mentioned several times in the same piece of news. Angela Merkel with an impressive degree is a good example. She relates to politicians like Francois Hollande, Federica Mogherini and Martin Schulz- all sharing her liberal views on refugee policy. Out of the politicians supporting anti-migrant policy Donald Tusk, David Cameron and Nicolas Sárközy with their connections could be mentioned. Connections spanning the two blocks aren’t rare either; they can also be found within the graph. The green cluster indicates the political elite of Russia and America as well as the central figures and terrorists of the war in Iraq and Syria.
Close to the center the group of the Church-related people- shown in light grey-, and the circle of Hungarian writers, poets and actors- shown in orange- can be seen. Not politically-related groups like Nobel-prize winner scientists and explorers, footballers, foreign actors and celebrities take place further away from the core.
Connections between personal names
In case of institution names, we have a relatively smaller network with 602 nodes and 3215 edges. Here are some of the graph’s interesting parameters: the average path length is 2,535, the diameter is 6 and the clustering coefficient is 0,74. When representing the results, we used filtering also based on the degree. Entities with at least 10 connections- this is the average degree in the network- were put on the dashboard. The green cluster represents political parties. Fidesz is mentioned together with other parties such as Jobbik, Demokratikus Koalíció and the Ellenzéki Párt in several articles. It also shows the strong connection between Jobbik and the last two. The political parties, the traditional and community sites- TV and radio channels, Facebook and Twitter- are sort of intertwined. A nicely highlighted thick edge is visible between M1 and the governing party. The reddish nodes indicate the German political parties while the grey nodes refer to the Austrian ones. The light blue cluster shows mostly international organizations. The violet one looks like a “melting pot” with MTI as its primary core and telecommunication companies, foreign parties and charity organizations as other members. MTI (Hungarian Telegraphic Office), is the entity with the highest degree being connected to almost every single institution on the graph. Knowing MTI’s profile- a Hungarian news agency, one of the oldest news agencies in the world-, this fact may not be surprising.
Connections between institutions
For the sake of clarity, the size of nodes in case of geographical names is unified. Altogether 28147 geographical names and their 46907 connections are shown. The diameter is 6, the average path length is 2667. Most nodes are situated in Hungary. Source countries of migration as well as the target ones are also significantly represented on the graph. Hungarian settlements close to the border have the highest degree; these are the ones mentioned the most frequently in the news: Bácsborsód and Zákányszék in case of the Serbian-Hungarian border; Csanádpalota, Mátészalka, Nyírmada and Nyírbogát in case of the Romanian-Hungarian border. Moving away from Hungary, Brussels has a considerably high degree with its connections spanning continents. Not surprisingly it is mentioned together with several Hungarian settlements along the border.
Connections between geographical names
Nowadays most articles contain images too, not only texts and their role is getting more and more important since they make more people read the story. As for social media, a good photo is simply a must. Therefore, together with the news we also collected the images. For a reader, it is easy to decide which image goes with which piece of news but to do the same is a challenge for a computer. We used several heuristics to tackle this problem; we assumed, for instance, that images of a tiny size were either logos or other design elements. We took the date of the first publication into account on several websites because of the visual recommendations at the end of the articles. Finally, some of the most frequent images were removed manually. Since processing images requires extensive hardware resources, it was important to remove duplicates. Finally, we had 38266 images left appearing in 28456 documents 62762 times altogether.
However, it is impossible and not even worth going through all of them. To have any kind of idea of what these images are about, a tool is needed. Luckily there are more than just one way of processing images. We chose Clarifai which adds tags- even in Hungarian- to the photos. Having a special dataset, we couldn’t use our results in an instant. Clarifai seems to have done its internship on images of white, middle-class western people, since photos of masses shot in refugee camps were constantly tagged as “festival”, but some tags like “rally” and “entertainment” were also over-represented. It seems we need to learn to live with these shortcomings so we simply got rid of certain tags (e.g. musician), while we kept others (e.g. festival) but with a significantly modified meaning. For instance, the festival tag in this case may refer to either a crowd, often behind the wall of law enforcement officers, or refugees resting somewhere. Although imperfect, the tags enable us to transform visual information to textual one and this way we can analyze the dataset.
We classified the images into eight topics by using the LDA method. In case of embedded images it’s worth studying research on the visual representation of minorities, such as for example Bernáth- Messing or Wright. Representation strategies, which often aim to alienate, are well-known from the literature and can also be found among the categories. A typical example for this is, when refugees are shown as masses, their faces hardly recognizable; or as “waves of humans” flowing towards Europe. A sharp contrast with this representation is how the politicians are shown: clearly and openly with their names and faces. This contrast and the negative connotation are intensified by the fact that in most cases the face of a refugee gets known only when they are wanted by the police. Very often the first photo of the person is shot during police action. Other representation strategies are also revealed by the topic model results. There are images of war areas or smaller groups and families with children on their way which make us more sensitive to their fate. The following photomontages show images which are the most characteristic of certain topics.
Maps, charts and screenshots
War areas, refugee camps and temporary residence of refugees
Members of armed forces, soldiers of war and target countries
Portraits, close-ups and “wanted” photos
At the border, at the fence, on the road and on the water
Images of smaller groups: children, families and young people
Time distribution of the topics above
Migrants, refugees, immigrants: what is the media suggesting?
Migrants, refugees, immigrants: what is the media suggesting?
Migrants, refugees, immigrants: what is the media suggesting?
The extreme values at the topics “Faceless crowd” and “Images of smaller groups and families” are present partly because we are not yet able to perfectly detach the images belonging to the given article from the other images on the same page.