Platform and accounts selection

Why Instagram?

Masochism, of course 😀. 

Jokes aside, we have decided to focus primarily on this platform taking into account the relevance of Instagram in social media marketing, as explained in the section above.

But dealing with Instagram is not for the faint of heart, and not surprisingly most of the existing research on social media content and advertising does not use Instagram but instead focuses on other platforms for which gathering content is easier.

We hope that sharing our trials and tribulations with Instagram may contribute to the field and other related studies in social media if nothing else in not repeating once again mistakes we have already made. 

As opposed to other platforms like Twitter, where accessing content is relatively easy and there is privileged access for researchers and universities, the current version of Instagram API is very restrictive. It is linked to Facebook API and it only allows to manage content from one’s own account, making it impossible to search for other users’ content or accounts with hashtags or keywords, regardless of the fact that everything is publicly available online.

This restriction is specified in the official documentation and we have been able to verify it using our Facebook developer account with the Graph API Explorer tool. We could not have privileged access to Instagram content as a research group either.

In 2021 Facebook introduced Facebook Open Research & Transparency (FORT) Analytics API - FORT Pages API - which provides access to “public pages and public posts where steps have been taken to reduce the exposure of personal information”. The content is accessible via JupyterHub over VPN with access granted only to approved Facebook Research Partners. Unfortunately, it’s limited to Facebook content only and while it mentions that work on access to Instagram is ongoing at the moment it does not provide the data.

To be able to access the content from non-personal accounts, we then found external solutions: using social media platforms’ dashboards to search for content and accounts and scraping techniques to gather content from a selected list of accounts.

See “Gathering the data” section for further details.
 

Photo by Souvik Banerjee on Unsplash

    Social media marketing platforms

    Social media marketing platforms have proliferated with the increase of social media marketing. What they sell is ‘access’ to influencers. Their clients are companies interested in advertising on social media. Access means in this case tools to contact and enroll the best creators for the task at hand on the market of reference. In some cases, the marketing platform also manages the actual campaigns with rich analytics, in others, the aim is purely that of matchmaking brands and creators.

    One of the products these companies tend to offer is a dashboard which usually includes a “discovery” option. This tool – which comes with a price tag of course – allows you to search for influencers based on a wide array of criteria, such as the hashtags or keywords mentioned, the number of followers, engagement rate, or the language used. 

    Professor Catalina Goanta and her team have used one of these platforms to select a list of accounts on which to focus their research. We have adopted a similar approach to come up with lists of accounts from a targeted area of interest. Discovery list come with a degree of imprecision and noise but using an external tool to select our datasets helped us avoid bias and we did not have to cherry-pick accounts.

    There are literally dozens of marketing platforms available on the market. To select the one that better suited our needs, we had to do some research and tested out a few. We scouted lists of best-performing marketing platforms (such as this and this) and shortlisted some of the top ones to get in touch with them. Apart from the pricing, we had specific needs and questions related to the off-label use we intended to do with these tools. The answers are not always clear on the websites so we went through a series of product demonstrations and got to meet some of the teams behind them. 

    While at first glance they all seem to be doing and offering the same thing, the differences are quite significant. 

    The first item of interest is the number of accounts they track which in turn affects the size of their database. This is important as this is our fish pool and it has a direct impact on the list of accounts that match specific criteria. We did not get in touch with all of them so this is just an indicative ballpark, but among the ones we did get information from, we have seen databases range from 1.6 million to 123 million users.  

    The second question we had is about the data collection method of accounts for their databases. Some of these are opt-in methods, in which Instagram users have to pay the marketing platform to be included in their databases. In some others, the marketing platforms only track people who are promoting brands which they have an agreement with. Another method of getting users' information is by scraping Instagram.

    The nuances all contribute to diversifying the offer: more or less bias, more engaged influencers, more or less noise, proven experience in dealing with advertisers. Our needs hinted at selecting a tool that had as little bias as possible.

    The final issue we cared about was the ease of use to access these databases and export the information to continue our research and whether we had limits on the amount of data. Ideally, we needed a Rest API access, which is rarely offered and when it is, comes at a much higher cost. Of course, the very small budget for the project was a big issue as well and the price is proportional to the amount of data accessed from the database.

    Most of these dashboards have a similar structure, and they include filters such as language, country, age of the audience, age of the creators, number of followers, etc. The extraction of the information is quite straightforward for some of these fields (eg. number of followers). But other filters use algorithms to retrieve the information, which is not always transparent. 

    For instance, it is not clear how the engagement rate is calculated or how the age of the user or the country is estimated until you actively asked that question. It is important to be aware of the use of algorithms as this can impact the selection of the accounts. As an example, we were told that one of these dashboards classified influencers with multiple flags as being from all of the countries on the flags. 

    After doing some research and interviewing a few companies, our selection fell on Tensor Social. They offered us a research/university package with API access to their database of 123 million accounts. This database is built by scraping Instagram (so, no opt-in method). It even offers TikTok and Youtube accounts which may allow us to extend our research in the future.

    There is a limit, however, on the number of accounts we can export per month. This is set to 10,000 exports per month and 250 reports per month. 

      Tensor Social 'Reports'

      The Reports are the second tier of information on the influencers made available to us on Tensor Social. It is a long page with deep dives on each user account. Some of this information is descriptive, such as the number of followers over time, the bio, the gender, and the hashtags. It also includes some historical and extra information such as the brands mentioned or an estimation of the audience's credibility.

      Tensor Social offers API access to the data which given our specific needs was essential. Clear documentation and a rich series of examples in different languages are provided here, but the definition of some parameters is not clear (eg. how the engagement rate is calculated). However, the service managers were quite helpful to solve any doubts.

      The team has used the dashboard to manually define and refine the criteria for our searches but ultimately we needed to pull the data from Tensor Social with an automated API request. To facilitate the collection of the queries and convey them to our engineers, we implemented a Chrome extension to retrieve the parameters from the dashboard in JSON format and store them locally. These queries were then matched to a series of hashtags grouped in hashtag groups, two of the content types in our Tracking Influencers API.

      This JSON payload then becomes the body of the final API request to Tensor Social, which allows for a more sophisticated configuration of parameters than the dashboard one.

      The Reports come at a higher cost and we have a strict limit on the number of requests. Our plan is to use this resource after the analysis phase, after the scraping of the influencers' posts.

        List of influencers

        Several studies have faced the same problem: how to select a significant pool of influencers. This has been one of the core decisions for our project. Prior to the selection of the accounts, we also had conversations with several experts who directly or indirectly helped shape our project. 

        The first impact of these conversations was the selection of the industry. Our original idea was to focus on the gaming industry, but we revised it as this would have steered us toward YouTube and Twich while Instagram was at best a marginal sector. We, therefore, chose to focus on “kids influencers”, one industry with increasing popularity that moves millions. The digital market of kids advertising on Instagram is one of the fastest growing ones and it is posing questions to legislators to further regulate this sector. 

        Following a similar approach to that taken by Goanta and her team, we used Tensor Social’s dashboard to select the accounts. The editorial team spent some time familiarising themselves with the tool and did several searches, adjusting the parameters each time, to better understand the type of accounts and the precision of the searches. 

        This was a useful exercise to understand the parameters and relevant hashtags and how each of them impacts the results. And it was also important to learn how to get to a number of around 2,000 accounts per country (a significant but also manageable number) for our analysis. 

        The selection of relevant hashtags is painstakingly manual and slow and we had to proceed by accumulation and repeated attempts. Luckily Tensor Social offers a word cloud tool that shows related hashtags and their popularity based on the specific one introduced first. In our case, the option to select related hashtags to fill in the blanks was quite handy. One limitation of the dashboard is that it does not allow doing an OR search with the hashtags selected, but it only works on AND basis. This limitation does not exist if the search is done with an API request. 

        Apart from the hashtags, we also used some of the other filters available on the dashboard to narrow down the number of users. However, as mentioned in the section above, some of these filters are based on a proprietary algorithm and we wanted to avoid as much as possible to use external logic that could potentially affect the results.

        Using filters with algorithms means accepting the influence of those algorithms in our project. This does not mean a negative impact, but it is something to consider. The more filters the more external interference.

        Each of the newsrooms in the JournalismAI Fellowship decided their own specific criteria, to take into account different geo-location and languages. There is some variation but most of us based the method of selection on relevant hashtags. Those used by Sky News, just to give a concrete example, were: #kidofinstagram, #kidsofinstagram, #instakids, #instakid, #miniinfluencer, #miniinfluencers, #babiesofinstagram, #babyofinstagram, #toddlerofinstagram. Although we have worked with three languages: English, Spanish and Italian, we should note that the hashtags jargon is spoken globally.

        The focus on hashtags contributes to the idea of a non-biased method of selection of the accounts, but we also used other filters (always balancing between the external interference and the usability of the filter for the project). Using the country and the language filter for the creator we narrow it down to the three countries of interest, and we also apply some filters related to performance (engagement over 1%, followers between 5k and 500k) to reduce the list. 

        Once the parameters were clear, the editorial members of the team use the Chrome extension described in the section above to get the JSON file for the technical team. This was the base of the API query and it populated a basic list of influencers on our database which includes:

        • the name of the account
        • the name of the user
        • Tensor Social account ID
        • the URL of the account
        • the picture
        • the number of followers
        • the engagement
        • the engagement rate
        • the country
        • language
        • some geographical coordinates

        This list was further reduced due to complications during the gathering phase explained in the following section. 

        We narrowed it down by considering only those accounts which use the related words to mummy/daddy, as we found that a common pattern was using “account run by mum” and similar sentences. 

        We also excluded accounts with very few or too many posts and followers. However, given that the characteristics of the population of accounts were different for each of the three countries considered, the criteria here were also different. For instance, it was excluded the bottom 5% of the accounts by the number of posts in the Italian population but this margin was 20% for the English ones as the number of accounts was far bigger. All the decisions to reduce the list can be found on the Github