Platform and accounts selection
Masochism, of course 😀.
Jokes aside, we have decided to focus primarily on this platform taking into account the relevance of Instagram in social media marketing, as explained in the section above.
But dealing with Instagram is not for the faint of heart, and not surprisingly most of the existing research on social media content and advertising does not use Instagram but instead focuses on other platforms for which gathering content is easier.
We hope that sharing our trials and tribulations with Instagram may contribute to the field and other related studies in social media if nothing else in not repeating once again mistakes we have already made.
As opposed to other platforms like Twitter, where accessing content is relatively easy and there is privileged access for researchers and universities, the current version of Instagram API is very restrictive. It is linked to Facebook API and it only allows to manage content from one’s own account, making it impossible to search for other users’ content or accounts with hashtags or keywords, regardless of the fact that everything is publicly available online.
This restriction is specified in the official documentation and we have been able to verify it using our Facebook developer account with the Graph API Explorer tool. We could not have privileged access to Instagram content as a research group either.
In 2021 Facebook introduced Facebook Open Research & Transparency (FORT) Analytics API - FORT Pages API - which provides access to “public pages and public posts where steps have been taken to reduce the exposure of personal information”. The content is accessible via JupyterHub over VPN with access granted only to approved Facebook Research Partners. Unfortunately, it’s limited to Facebook content only and while it mentions that work on access to Instagram is ongoing at the moment it does not provide the data.
To be able to access the content from non-personal accounts, we then found external solutions: using social media platforms’ dashboards to search for content and accounts and scraping techniques to gather content from a selected list of accounts.
See “Gathering the data” section for further details.
List of influencers
Several studies have faced the same problem: how to select a significant pool of influencers. This has been one of the core decisions for our project. Prior to the selection of the accounts, we also had conversations with several experts who directly or indirectly helped shape our project.
The first impact of these conversations was the selection of the industry. Our original idea was to focus on the gaming industry, but we revised it as this would have steered us toward YouTube and Twich while Instagram was at best a marginal sector. We, therefore, chose to focus on “kids influencers”, one industry with increasing popularity that moves millions. The digital market of kids advertising on Instagram is one of the fastest growing ones and it is posing questions to legislators to further regulate this sector.
Following a similar approach to that taken by Goanta and her team, we used Tensor Social’s dashboard to select the accounts. The editorial team spent some time familiarising themselves with the tool and did several searches, adjusting the parameters each time, to better understand the type of accounts and the precision of the searches.
This was a useful exercise to understand the parameters and relevant hashtags and how each of them impacts the results. And it was also important to learn how to get to a number of around 2,000 accounts per country (a significant but also manageable number) for our analysis.
The selection of relevant hashtags is painstakingly manual and slow and we had to proceed by accumulation and repeated attempts. Luckily Tensor Social offers a word cloud tool that shows related hashtags and their popularity based on the specific one introduced first. In our case, the option to select related hashtags to fill in the blanks was quite handy. One limitation of the dashboard is that it does not allow doing an
OR search with the hashtags selected, but it only works on
AND basis. This limitation does not exist if the search is done with an API request.
Apart from the hashtags, we also used some of the other filters available on the dashboard to narrow down the number of users. However, as mentioned in the section above, some of these filters are based on a proprietary algorithm and we wanted to avoid as much as possible to use external logic that could potentially affect the results.
Using filters with algorithms means accepting the influence of those algorithms in our project. This does not mean a negative impact, but it is something to consider. The more filters the more external interference.
Each of the newsrooms in the JournalismAI Fellowship decided their own specific criteria, to take into account different geo-location and languages. There is some variation but most of us based the method of selection on relevant hashtags. Those used by Sky News, just to give a concrete example, were: #kidofinstagram, #kidsofinstagram, #instakids, #instakid, #miniinfluencer, #miniinfluencers, #babiesofinstagram, #babyofinstagram, #toddlerofinstagram. Although we have worked with three languages: English, Spanish and Italian, we should note that the hashtags jargon is spoken globally.
The focus on hashtags contributes to the idea of a non-biased method of selection of the accounts, but we also used other filters (always balancing between the external interference and the usability of the filter for the project). Using the country and the language filter for the creator we narrow it down to the three countries of interest, and we also apply some filters related to performance (engagement over 1%, followers between 5k and 500k) to reduce the list.
Once the parameters were clear, the editorial members of the team use the Chrome extension described in the section above to get the JSON file for the technical team. This was the base of the API query and it populated a basic list of influencers on our database which includes:
- the name of the account
- the name of the user
- Tensor Social account ID
- the URL of the account
- the picture
- the number of followers
- the engagement
- the engagement rate
- the country
- some geographical coordinates
This list was further reduced due to complications during the gathering phase explained in the following section.
We narrowed it down by considering only those accounts which use the related words to mummy/daddy, as we found that a common pattern was using “account run by mum” and similar sentences.
We also excluded accounts with very few or too many posts and followers. However, given that the characteristics of the population of accounts were different for each of the three countries considered, the criteria here were also different. For instance, it was excluded the bottom 5% of the accounts by the number of posts in the Italian population but this margin was 20% for the English ones as the number of accounts was far bigger. All the decisions to reduce the list can be found on the Github.
- Cordeiro, V. C. , “Kidfluencers” and Social Media: The Evolution of Child Exploitation in the Digital Age. Humanium. Web page
- Milmo, D., UK must protect child influencers from exploitation, MPs say. The Guardian
- Digital, Culture, Media and Sport Committee, Influencers: lights, camera, inaction?