A first look at the data
After a few months of working on this project, we had to make a decision as it required more time than that allocated for this fellowship. We were aware that the phase in which we would be able to use more AI techniques (the core of this fellowship) was during the analysis. However, the accuracy of any AI model is highly dependent on the quality of the data.
Therefore, we decided to invest the time of this programme in researching and creating a method to obtain quality data and in collecting a substantial amount of data that would allow us to train a model to identify undisclosed partnerships in the future.
This might also be a starting point to address other problems and identify different elements on the social media platform, like - for instance - fake followers, bots, or misleading, dangerous, or unhealthy claims in posts.
Although the main analysis will be done in the next phase, we have run a preliminary statistical and text analysis based on 447 accounts collected. This helped to understand the population of accounts better, to identify some useful fields such as
is_paid_partnership, to find that Meta is classifying mummies (and daddies) accounts as state-controlled media, to explore hashtags and to find out that only one in seven of the posts that use advertisement hashtags publishes them at the beginning of the caption.
Exploring the population
This initial analysis has considered the fields Meta uses to categorise profiles and posts which could help verify whether these are commercial or not. These are:
transparencyProduct for the profiles, and
shoud_request_ads for posts.
It was not possible to find official explanations about these fields, nor the range of values each of them can retrieve. The information used in this analysis has been therefore deduced directly from the existing data and can be partial.
But adding an extra level of research we could find some explanations on official Instagram pages. For instance,
isBusinessAccount might refer to those individuals who actively create a business account following Instagram rules. Similarly, isVerified will follow the rules for verification for celebrities, businesses, highly searched people, or public figures.
Less than a quarter of the accounts we analysed are business accounts and very few of them are verified. The highest proportion of business accounts was in the British population, with 39 out of the 150 users (26%) and the highest proportion of verification was in the Italian population (with four out of 147 accounts).
We found that the most popular
categoryName in our population - which we understand is how the user defines themselves and it is used regardless it is a business account but we have not been able to corroborate this - were digital creator, personal blog, and blogger. The analysis also shows that the business accounts with a
businessCategoryName use the option “Creators and celebrities.” We could not find official Instagram information, but some blogposts have said there are more than a thousand options for business categories.
No official explanation has been found either on the fields
transparencyLabel, and all the accounts have the same values for these two. TransparencyLabel is always NULL for any of the accounts, while all the accounts have
state_controlled_media under the
As explained in the section about how to select a list of influencers, we are analysing accounts run by mummies (and some daddies). However, Instagram is classifying all of them as “state-controlled media,” which means:
“(...) media outlets that Instagram believes may be partially or wholly under the editorial control of their government, based on our own research and assessment against a set of criteria developed for this purpose. We hold these accounts to a higher standard of transparency because we believe they combine the influence of a media organization with the backing of a state.
Instagram seeks to identify these organizations by using our definition and standards to review the available information about their ownership, governance, sources of funding, and processes that may help to ensure editorial independence.”
(For a full explanation see here).
We have also analysed almost 100,000 posts, most of them -two-thirds- used the carousel container, with the feed being the second most popular method to share the post.
All the posts analysed were considered not commercial, according to the “commerciality_status” field, for which we could not find information to understand its meaning, despite the fact that some of the accounts are set up as business accounts.
Similarly, there is no information on the
should_request_ad field, with FALSE in all the posts, and it could not be verified whether this is a proxy from Instagram to assess if the posts are a commercial promotion. We can assume, however, that the “is_paid_partnership” refers to those posts for which the branded tool from Instagram has been used.
Very few of the posts analysed (4%) disclaimed that the posts were the result of a commercial relationship, with the Italian population with the highest rate of disclaimers. And less than a fifth of the users with business accounts has posts with the “is paid partnership” label turned on.
Proportion of isBusinessAccount in the 3 groups
|Business accounts that have a business contact||2||7||18|
|Posts of non business account having is_paid_partnership=True||2395||777||53|
|Accounts with is_business_account=False but with more than 50% of the posts with is_paid_partnership=True||6||0||0(*)|
|Accounts with is_business_account=False but with more than 10% of the posts with is_paid_partnership=True||23||12||0(*)|
|Most frequent categoryName||Digital creator 60|
Personal blog 25
Personal blog 42
Digital creator 23
|Personal blog 24|
|Most frequent businessCategoryName||Creators & Celebrities 24||Creators & Celebrities 32||Creators & Celebrities 23|
|categoryName & isBusinessAccount=True||Personal blog|
Personal blog 6
(*) Note: The proportion of is_paid_partnership is just 0.5%
|Number of accounts with posts_count in the upper quartile||37||38||38|
|most common categories||Digital creator 16 |
Public figure 4
|Personal blog 13 |
|Personal blog 3 |
|accounts with is_paid_partnership=True||88||66||16|
|accounts with is_paid_partnership=True and is_business_account=True||13 (15%)||15 (23%)||5 (31%)|
For the preliminary text analysis, we have looked at the use of the word mother (or any forms like mum, mummy, mom, motherhood, mama…) on the biography and the username fields and what percentage they represent from the entire set of accounts we have.
In this type of analysis we had to take into account some crucial factors. First of all, there is a linguistic diversity not only among the 3 groups but also within each group itself and even within the same text, as it can contain sentences in one language with words and hashtags in another.
|PROFILE BIOGRAPHY||LANGUAGE||SOLE24ORE||SKY NEWS||INFOBAE|
(*) 1 empty field
We have, therefore, further refined the language recognition procedure started during the data cleaning phase, obtaining good results (only 4 biographies required manual language attribution).
We also considered that even emojis can express the same meaning than words in a different way. In order not to lose this information we use the anyascii library to replace Unicode characters with the equivalent English strings. This may have somewhat increased the number of profiles in English compared to a text-only exam but it hasn’t affected the analysis in our case, since the goal is to search for keywords in all the languages in which the posts are written.
We also had to verify that we reasonably completed the list of word possibilities about motherhood within the search keys in the different languages and that the procedure can be also adapted to new languages that may be added in the future.
Therefore starting from a list of keywords compiled manually in the native language, we then used the library nltk to evaluate further possible synonyms to add in our search and the library deep_translator to apply it to different languages.
|PROFILE BIOGRAPHY||SOLE24ORE||SKY NEWS||INFOBAE|
|keywords||babbo, dad, famiglia, familia, family, father, genitore, genitori, madre, madres, madri, mami, mamma, mamme, mammina, mamá, mamás, maternidad, maternita, maternity, maternità, mom, mommy, moms, mother, mothers, padre, padres, papa, papas, papi, papà, papá, parent, parents, paternidad, paternita, paternity||dad, dada, dadda, daddy, dads, father, fathers, mama, mammy, mom, momma, mommy, moms, mother, motherhood, mothers, mum, mummy, mums, pappa, parent, parenthood, parenting, parents||dad, dadda, daddy, father, madre, mama, mamá, maternidad, maternity, mom, mother, motherhood, mum, mummy, padre, papa, papá, parent, parenthood, parenting, paternidad, paternity|
|Accounts containing keywords||72||147||42|
|% of total accounts||49.0||98.0||28.0|
Linguistic diversity among the 3 groups
We started to analyse even the captions accompanying the images.
The cleaning procedure has been refined also in this case but the variability of characters and hashtags is greater than in biographies so not all language recognition hitches have been dissolved yet.
The hashtags had already been imported into a separate field by identifying them using the characteristic symbol #, so in this first phase the language was not involved.
We have also analysed the hashtags, in particular, those linked to sponsorship and the position where they appear within the caption, given that any sponsorship must be declared at the beginning:
Although the analysis of the advertisement hashtags has not been exhaustive and it needs further revision, this preliminary analysis shows that one in six of the posts use these hashtags (#ads and its variations). However, when they are used, they appear at the beginning of the post as recommended in one in seven the posts that contain any of them.
Most frequent hashtags in the 3 groups
|20 most frequent hashtags||love: 2876 |
vitadamamma: 1725 mammeitaliane: 1480
photooftheday: 1040 mammaefiglia: 1015
|mumsofinstagram: 3526 |
discoverunder10k: 1992 mummybloggeruk: 1987
|love: 2917 |
instagood: 1387 weddingphotography: 1211
|Caption containing any adv hashtag||False||30183||31505||N/A|
|Caption containing any adv hashtag at the beginning (*)||319||372||N/A|
(*) By the first 40 characters