Analysis
A first look at the data
After a few months of working on this project, we had to make a decision as it required more time than that allocated for this fellowship. We were aware that the phase in which we would be able to use more AI techniques (the core of this fellowship) was during the analysis. However, the accuracy of any AI model is highly dependent on the quality of the data.
Therefore, we decided to invest the time of this programme in researching and creating a method to obtain quality data and in collecting a substantial amount of data that would allow us to train a model to identify undisclosed partnerships in the future.
This might also be a starting point to address other problems and identify different elements on the social media platform, like - for instance - fake followers, bots, or misleading, dangerous, or unhealthy claims in posts.
Although the main analysis will be done in the next phase, we have run a preliminary statistical and text analysis based on 447 accounts collected. This helped to understand the population of accounts better, to identify some useful fields such as is_paid_partnership
, to find that Meta is classifying mummies (and daddies) accounts as state-controlled media, to explore hashtags and to find out that only one in seven of the posts that use advertisement hashtags publishes them at the beginning of the caption.
Exploring the population
This initial analysis has considered the fields Meta uses to categorise profiles and posts which could help verify whether these are commercial or not. These are: isBusinessAccount
, categoryName
, businessCategoryName
, isVerified
, transparencyLabel
and transparencyProduct
for the profiles, and is_paid_partnership
, commerciality_status
and shoud_request_ads
for posts.
It was not possible to find official explanations about these fields, nor the range of values each of them can retrieve. The information used in this analysis has been therefore deduced directly from the existing data and can be partial.
But adding an extra level of research we could find some explanations on official Instagram pages. For instance, isBusinessAccount
might refer to those individuals who actively create a business account following Instagram rules. Similarly, isVerified will follow the rules for verification for celebrities, businesses, highly searched people, or public figures.
Less than a quarter of the accounts we analysed are business accounts and very few of them are verified. The highest proportion of business accounts was in the British population, with 39 out of the 150 users (26%) and the highest proportion of verification was in the Italian population (with four out of 147 accounts).
We found that the most popular categoryName
in our population - which we understand is how the user defines themselves and it is used regardless it is a business account but we have not been able to corroborate this - were digital creator, personal blog, and blogger. The analysis also shows that the business accounts with a businessCategoryName
use the option “Creators and celebrities.” We could not find official Instagram information, but some blogposts have said there are more than a thousand options for business categories.
No official explanation has been found either on the fields transparencyProduct
and transparencyLabel
, and all the accounts have the same values for these two. TransparencyLabel is always NULL for any of the accounts, while all the accounts have state_controlled_media
under the trasparencyProduct
category.
As explained in the section about how to select a list of influencers, we are analysing accounts run by mummies (and some daddies). However, Instagram is classifying all of them as “state-controlled media,” which means:
“(...) media outlets that Instagram believes may be partially or wholly under the editorial control of their government, based on our own research and assessment against a set of criteria developed for this purpose. We hold these accounts to a higher standard of transparency because we believe they combine the influence of a media organization with the backing of a state.
Instagram seeks to identify these organizations by using our definition and standards to review the available information about their ownership, governance, sources of funding, and processes that may help to ensure editorial independence.”
(For a full explanation see here).
We have also analysed almost 100,000 posts, most of them -two-thirds- used the carousel container, with the feed being the second most popular method to share the post.
All the posts analysed were considered not commercial, according to the “commerciality_status” field, for which we could not find information to understand its meaning, despite the fact that some of the accounts are set up as business accounts.
Similarly, there is no information on the should_request_ad
field, with FALSE in all the posts, and it could not be verified whether this is a proxy from Instagram to assess if the posts are a commercial promotion. We can assume, however, that the “is_paid_partnership” refers to those posts for which the branded tool from Instagram has been used.
Very few of the posts analysed (4%) disclaimed that the posts were the result of a commercial relationship, with the Italian population with the highest rate of disclaimers. And less than a fifth of the users with business accounts has posts with the “is paid partnership” label turned on.
Proportion of isBusinessAccount in the 3 groups
Influencer profiles
PROFILE | SOLE24ORE | SKY NEWS | INFOBAE | |
Total accounts | 147 | 150 | 150 | |
isBusinessAccount | False | 120 | 111 | 117 |
True | 27 | 39 | 33 | |
False % | 81.6 | 74.0 | 78.0 | |
True % | 18.4 | 26.0 | 22.0 | |
Business accounts that have a business contact | 2 | 7 | 18 | |
Posts of non business account having is_paid_partnership=True | 2395 | 777 | 53 | |
Accounts with is_business_account=False but with more than 50% of the posts with is_paid_partnership=True | 6 | 0 | 0(*) | |
Accounts with is_business_account=False but with more than 10% of the posts with is_paid_partnership=True | 23 | 12 | 0(*) | |
Most frequent categoryName | Digital creator 60 Personal blog 25 Blogger 23 | Blogger 56 Personal blog 42 Digital creator 23 | Personal blog 24 Photographer 12 Blogger 11 | |
Most frequent businessCategoryName | Creators & Celebrities 24 | Creators & Celebrities 32 | Creators & Celebrities 23 | |
categoryName & isBusinessAccount=True | Personal blog 11 | Personal blog 17 | Photographer 9 Personal blog 6 | |
isVerified | False | 143 | 149 | 149 |
True | 4 | 1 | 1 | |
False % | 97.3 | 99.3 | 99.3 | |
True % | 2.7 | 0.7 | 0.7 | |
transparencyProduct | STATE_ CONTROLLED_ MEDIA | 147 | 150 | 150 |
transparencyLabel | null | 147 | 150 | 150 |
(*) Note: The proportion of is_paid_partnership is just 0.5%
Influencer posts
POSTS | SOLE24ORE | SKY NEWS | INFOBAE | |
Total posts | 32713 | 33858 | 33223 | |
postCounts | mean | 931.8 | 1302.0 | 770.2 |
std | 820.2 | 945.3 | 575.5 | |
min | 158.0 | 201.0 | 191.0 | |
25% | 378.5 | 241.5 | 296.0 | |
50% | 696.0 | 1343.5 | 676.5 | |
75% | 1067.5 | 1847.7 | 980.7 | |
max | 3921.0 | 3991.0 | 2902.0 | |
Number of accounts with posts_count in the upper quartile | 37 | 38 | 38 | |
is_business_account=True | 7 | 7 | 5 | |
most common categories | Digital creator 16 Public figure 4 | Personal blog 13 Blogger 12 | Personal blog 3 Artist 3 | |
commerciality_status | not_commercial | 32713 | 33858 | 33223 |
should_request_ads | False | 32713 | 33858 | 33223 |
is_paid_partnership | False | 30170 | 32846 | 33073 |
True | 2543 | 1012 | 150 | |
False % | 92.2 | 97.0 | 99.5 | |
True % | 7.8 | 3.0 | 0.5 | |
accounts with is_paid_partnership=True | 88 | 66 | 16 | |
accounts with is_paid_partnership=True and is_business_account=True | 13 (15%) | 15 (23%) | 5 (31%) | |
isVideo | False | 27810 | 30111 | 29141 |
True | 4903 | 3747 | 4082 | |
False % | 85.0 | 88.9 | 87.7 | |
True % | 15.0 | 11.1 | 12.3 | |
product_type | carousel_container | 22717 | 23147 | 22214 |
feed | 5995 | 7863 | 8067 | |
clips | 3863 | 2597 | 2723 | |
igtv | 138 | 251 | 219 | |
carousel_container % | 69.4 | 68.4 | 66.9 | |
feed % | 18.3 | 23.2 | 24.3 | |
clips % | 11.8 | 7.7 | 8.2 | |
igtv % | 0.4 | 0.7 | 0.7 |
Profiles analysis
For the preliminary text analysis, we have looked at the use of the word mother (or any forms like mum, mummy, mom, motherhood, mama…) on the biography and the username fields and what percentage they represent from the entire set of accounts we have.
In this type of analysis we had to take into account some crucial factors. First of all, there is a linguistic diversity not only among the 3 groups but also within each group itself and even within the same text, as it can contain sentences in one language with words and hashtags in another.
PROFILE BIOGRAPHY | LANGUAGE | SOLE24ORE | SKY NEWS | INFOBAE |
English | 70 | 149 | 69 | |
Italian | 64 | |||
Spanish | 12 | 1 | 81 | |
Other | 1(*) |
(*) 1 empty field
We have, therefore, further refined the language recognition procedure started during the data cleaning phase, obtaining good results (only 4 biographies required manual language attribution).
We also considered that even emojis can express the same meaning than words in a different way. In order not to lose this information we use the anyascii library to replace Unicode characters with the equivalent English strings. This may have somewhat increased the number of profiles in English compared to a text-only exam but it hasn’t affected the analysis in our case, since the goal is to search for keywords in all the languages in which the posts are written.
We also had to verify that we reasonably completed the list of word possibilities about motherhood within the search keys in the different languages and that the procedure can be also adapted to new languages that may be added in the future.
Therefore starting from a list of keywords compiled manually in the native language, we then used the library nltk to evaluate further possible synonyms to add in our search and the library deep_translator to apply it to different languages.
PROFILE BIOGRAPHY | SOLE24ORE | SKY NEWS | INFOBAE |
keywords | babbo, dad, famiglia, familia, family, father, genitore, genitori, madre, madres, madri, mami, mamma, mamme, mammina, mamá, mamás, maternidad, maternita, maternity, maternità, mom, mommy, moms, mother, mothers, padre, padres, papa, papas, papi, papà, papá, parent, parents, paternidad, paternita, paternity | dad, dada, dadda, daddy, dads, father, fathers, mama, mammy, mom, momma, mommy, moms, mother, motherhood, mothers, mum, mummy, mums, pappa, parent, parenthood, parenting, parents | dad, dadda, daddy, father, madre, mama, mamá, maternidad, maternity, mom, mother, motherhood, mum, mummy, padre, papa, papá, parent, parenthood, parenting, paternidad, paternity |
Accounts containing keywords | 72 | 147 | 42 |
% of total accounts | 49.0 | 98.0 | 28.0 |
Linguistic diversity among the 3 groups
Posts analysis
We started to analyse even the captions accompanying the images.
The cleaning procedure has been refined also in this case but the variability of characters and hashtags is greater than in biographies so not all language recognition hitches have been dissolved yet.
The hashtags had already been imported into a separate field by identifying them using the characteristic symbol #, so in this first phase the language was not involved.
We have also analysed the hashtags, in particular, those linked to sponsorship and the position where they appear within the caption, given that any sponsorship must be declared at the beginning: #ad
, #advertisement
, #gifted
, #collaborazione
, #advert
, #advertising
, #sponsored
Although the analysis of the advertisement hashtags has not been exhaustive and it needs further revision, this preliminary analysis shows that one in six of the posts use these hashtags (#ads and its variations). However, when they are used, they appear at the beginning of the post as recommended in one in seven the posts that contain any of them.
Most frequent hashtags in the 3 groups
POSTS | SOLE24ORE | SKY NEWS | INFOBAE |
Total | 32713 | 33858 | 33223 |
Unique hashtags | 40765 | 41512 | 24480 |
20 most frequent hashtags | love: 2876 family: 2676 momlife: 2215 baby: 2163 vitadamamma: 1725 mammeitaliane: 1480 babygirl: 1465 mom: 1273 instagood: 1231 ootd: 1222 summer: 1209 kids: 1202 kidsfashion: 1197 babyboy: 1168 mamma: 1157 picoftheday: 1155 babymodel: 1122 famiglia: 1041 photooftheday: 1040 mammaefiglia: 1015 | mumsofinstagram: 3526 mumlife: 3375 love: 2829 mummyblogger: 2002 discoverunder10k: 1992 mummybloggeruk: 1987 gifted: 1972 toddlerlife: 1916 babygirl: 1868 family: 1864 kidsofinstagram: 1742 instagood: 1643 momlife: 1515 kidsfashion: 1453 babiesofinstagram: 1433 ad: 1364 mumssupportingmums: 1328 motherhood: 1285 toddler: 1273 baby: 1263 | love: 2917 momlife: 2046 photooftheday: 1763 family: 1621 mexico: 1620 instagood: 1387 weddingphotography: 1211 maternidad: 1200 familia: 1169 photography: 1136 baby: 1078 girl: 1021 babygirl: 954 photo: 857 babyboy: 827 happy: 806 picoftheday: 794 mom: 765 tuxpanveracruz: 749 moments: 741 |
Caption containing any adv hashtag | False | 30183 | 31505 | N/A |
True | 2530 | 2353 | N/A | |
Caption containing any adv hashtag at the beginning (*) | 319 | 372 | N/A |
(*) By the first 40 characters