A first look at the data

After a few months of working on this project, we had to make a decision as it required more time than that allocated for this fellowship. We were aware that the phase in which we would be able to use more AI techniques (the core of this fellowship) was during the analysis. However, the accuracy of any AI model is highly dependent on the quality of the data.

Therefore, we decided to invest the time of this programme in researching and creating a method to obtain quality data and in collecting a substantial amount of data that would allow us to train a model to identify undisclosed partnerships in the future.

This might also be a starting point to address other problems and identify different elements on the social media platform, like - for instance - fake followers, bots, or misleading, dangerous, or unhealthy claims in posts.

Although the main analysis will be done in the next phase, we have run a preliminary statistical and text analysis based on 447 accounts collected. This helped to understand the population of accounts better, to identify some useful fields such as is_paid_partnership, to find that Meta is classifying mummies (and daddies) accounts as state-controlled media, to explore hashtags and to find out that only one in seven of the posts that use advertisement hashtags publishes them at the beginning of the caption. 

    Exploring the population

    This initial analysis has considered the fields Meta uses to categorise profiles and posts which could help verify whether these are commercial or not. These are: isBusinessAccount, categoryName, businessCategoryName, isVerified, transparencyLabel and transparencyProduct for the profiles, and is_paid_partnership, commerciality_status and shoud_request_ads for posts. 

    It was not possible to find official explanations about these fields, nor the range of values each of them can retrieve. The information used in this analysis has been therefore deduced directly from the existing data and can be partial. 

    But adding an extra level of research we could find some explanations on official Instagram pages. For instance, isBusinessAccount might refer to those individuals who actively create a business account following Instagram rules. Similarly, isVerified will follow the rules for verification for celebrities, businesses, highly searched people, or public figures.

    Less than a quarter of the accounts we analysed are business accounts and very few of them are verified. The highest proportion of business accounts was in the British population, with 39 out of the 150 users (26%) and the highest proportion of verification was in the Italian population (with four out of 147 accounts).

    We found that the most popular categoryName in our population - which we understand is how the user defines themselves and it is used regardless it is a business account but we have not been able to corroborate this - were digital creator, personal blog, and blogger. The analysis also shows that the business accounts with a businessCategoryName use the option “Creators and celebrities.” We could not find official Instagram information, but some blogposts have said there are more than a thousand options for business categories.

    No official explanation has been found either on the fields transparencyProduct and transparencyLabel, and all the accounts have the same values for these two. TransparencyLabel is always NULL for any of the accounts, while all the accounts have state_controlled_media under the trasparencyProduct category. 

    As explained in the section about how to select a list of influencers, we are analysing accounts run by mummies (and some daddies). However, Instagram is classifying all of them as “state-controlled media,” which means:

    “(...) media outlets that Instagram believes may be partially or wholly under the editorial control of their government, based on our own research and assessment against a set of criteria developed for this purpose. We hold these accounts to a higher standard of transparency because we believe they combine the influence of a media organization with the backing of a state.

    Instagram seeks to identify these organizations by using our definition and standards to review the available information about their ownership, governance, sources of funding, and processes that may help to ensure editorial independence.”

    (For a full explanation see here).

    We have also analysed almost 100,000 posts, most of them -two-thirds- used the carousel container, with the feed being the second most popular method to share the post. 

    All the posts analysed were considered not commercial, according to the “commerciality_status” field, for which we could not find information to understand its meaning, despite the fact that some of the accounts are set up as business accounts. 

    Similarly, there is no information on the should_request_ad field, with FALSE in all the posts, and it could not be verified whether this is a proxy from Instagram to assess if the posts are a commercial promotion. We can assume, however, that the “is_paid_partnership” refers to those posts for which the branded tool from Instagram has been used. 

    Very few of the posts analysed (4%) disclaimed that the posts were the result of a commercial relationship, with the Italian population with the highest rate of disclaimers. And less than a fifth of the users with business accounts has posts with the “is paid partnership” label turned on.  

    Proportion of isBusinessAccount in the 3 groups

    Influencer profiles

    Total accounts 147150150
     False %81.674.078.0
     True %18.426.022.0
    Business accounts that have a business contact 2718
    Posts of non business account having is_paid_partnership=True 239577753
    Accounts with is_business_account=False but with more than 50% of the posts with is_paid_partnership=True 600(*)
    Accounts with is_business_account=False but with more than 10% of the posts with is_paid_partnership=True 23120(*)
    Most frequent categoryName Digital creator  60
    Personal blog  25
    Blogger  23
    Blogger  56
    Personal blog  42
    Digital creator  23
    Personal blog  24
    Photographer  12
    Blogger  11
    Most frequent businessCategoryName Creators & Celebrities  24Creators & Celebrities  32Creators & Celebrities  23
    categoryName & isBusinessAccount=True Personal blog
    Personal blog
    Photographer  9
    Personal blog  6
     False %97.399.399.3
     True %

    (*) Note: The proportion of is_paid_partnership is just 0.5%

    Influencer posts

    Total posts 327133385833223
    Number of accounts with posts_count in the upper quartile 373838
     most common categoriesDigital creator  16 
    Public figure    4
    Personal blog  13 
    Blogger  12
    Personal blog  3 
    Artist  3
     False %
     True %
     accounts with is_paid_partnership=True886616
     accounts with is_paid_partnership=True and is_business_account=True13 (15%)15 (23%)5 (31%)
     False %85.088.987.7
     True %
     carousel_container %69.468.466.9
     feed %18.323.224.3
     clips %
     igtv %

      Profiles analysis

      For the preliminary text analysis, we have looked at the use of the word mother (or any forms like mum, mummy, mom, motherhood, mama…) on the biography and the username fields and what percentage they represent from the entire set of accounts we have.

      In this type of analysis we had to take into account some crucial factors. First of all, there is a linguistic diversity not only among the 3 groups but also within each group itself and even within the same text, as it can contain sentences in one language with words and hashtags in another.


      (*) 1 empty field

      We have, therefore, further refined the language recognition procedure started during the data cleaning phase, obtaining good results (only 4 biographies required manual language attribution).

      We also considered that even emojis can express the same meaning than words in a different way. In order not to lose this information we use the anyascii library to replace Unicode characters with the equivalent English strings. This may have somewhat increased the number of profiles in English compared to a text-only exam but it hasn’t affected the analysis in our case, since the goal is to search for keywords in all the languages in which the posts are written.

      We also had to verify that we reasonably completed the list of word possibilities about motherhood within the search keys in the different languages and that the procedure can be also adapted to new languages that may be added in the future.

      Therefore starting from a list of keywords compiled manually in the native language, we then used the library nltk to evaluate further possible synonyms to add in our search and the library deep_translator to apply it to different languages.

      keywordsbabbo, dad, famiglia, familia, family, father, genitore, genitori, madre, madres, madri, mami, mamma, mamme, mammina, mamá, mamás, maternidad, maternita, maternity, maternità, mom, mommy, moms, mother, mothers, padre, padres, papa, papas, papi, papà, papá, parent, parents, paternidad, paternita, paternitydad, dada, dadda, daddy, dads, father, fathers, mama, mammy, mom, momma, mommy, moms, mother, motherhood, mothers, mum, mummy, mums, pappa, parent, parenthood, parenting, parentsdad, dadda, daddy, father, madre, mama, mamá, maternidad, maternity, mom, mother, motherhood, mum, mummy, padre, papa, papá, parent, parenthood, parenting, paternidad, paternity
      Accounts containing keywords7214742
      % of total accounts49.098.028.0

      Linguistic diversity among the 3 groups

        Posts analysis

        We started to analyse even the captions accompanying the images.

        The cleaning procedure has been refined also in this case but the variability of characters and hashtags is greater than in biographies so not all language recognition hitches have been dissolved yet.

        The hashtags had already been imported into a separate field by identifying them using the characteristic symbol #, so in this first phase the language was not involved.

        We have also analysed the hashtags, in particular, those linked to sponsorship and the position where they appear within the caption, given that any sponsorship must be declared at the beginning: 
        #ad, #advertisement, #gifted, #collaborazione, #advert, #advertising, #sponsored

        Although the analysis of the advertisement hashtags has not been exhaustive and it needs further revision, this preliminary analysis shows that one in six of the posts use these hashtags (#ads and its variations). However, when they are used, they appear at the beginning of the post as recommended in one in seven the posts that contain any of them.

        Most frequent hashtags in the 3 groups

        Unique hashtags407654151224480
        20 most frequent hashtagslove: 2876  
        family: 2676  
        momlife: 2215  
        baby: 2163  
        vitadamamma: 1725 mammeitaliane: 1480  
        babygirl: 1465  
        mom: 1273  
        instagood: 1231  
        ootd: 1222  
        summer: 1209  
        kids: 1202  
        kidsfashion: 1197  
        babyboy: 1168  
        mamma: 1157  
        picoftheday: 1155  
        babymodel: 1122  
        famiglia: 1041
        photooftheday: 1040 mammaefiglia: 1015
        mumsofinstagram: 3526  
        mumlife: 3375  
         love: 2829  
        mummyblogger: 2002
        discoverunder10k: 1992 mummybloggeruk: 1987  
        gifted: 1972  
        toddlerlife: 1916  
        babygirl: 1868  
        family: 1864  
        kidsofinstagram: 1742  
        instagood: 1643  
        momlife: 1515  
        kidsfashion: 1453  
        babiesofinstagram: 1433  
        ad: 1364  
        mumssupportingmums: 1328  
        motherhood: 1285  
        toddler: 1273  
        baby: 1263
        love: 2917  
        momlife: 2046  
        photooftheday: 1763  
        family: 1621  
        mexico: 1620  
        instagood: 1387 weddingphotography: 1211  
        maternidad: 1200  
        familia: 1169  
        photography: 1136  
        baby: 1078  
        girl: 1021  
        babygirl: 954  
        photo: 857  
        babyboy: 827  
        happy: 806  
        picoftheday: 794  
        mom: 765  
        tuxpanveracruz: 749
        moments: 741

        Caption containing any adv hashtagFalse3018331505N/A
        Caption containing any adv hashtag at the beginning (*) 319372N/A

        (*) By the first 40 characters