The Project (in 10 Minutes)

Tracking Influencers: four lessons when dealing with Instagram

Can you spot the difference between these two posts?

Only one of them is fully complying with the rules (UK recommendations at least).

The second one clearly states there has been a relationship with the brand sponsored by using the Instagram branded content tool (see the top part of the post).

But the first one is not disclosing any relationship with the brand using an official or recommended method. (If you are new to this topic, we have summarized our research on social media marketing, why this is important, and what Instagram and some governments say - and don’t say - about hidden advertisement here).

It would be hard to fully prove with only that post that there has been compensation from the brand to the Instagram user. However, the type of picture (with the product in a relevant centered position), the content of the text (all about how amazing the product is), and the hashtag #gifted buried with the other 14 hashtags at the bottom of the post are indications of a potential relationship not clearly disclosed as such.

This is only an example examined manually. But is it possible to create a method that would allow us to classify posts and accounts under a “potentially not disclosed commercial relationship” category in a more automatic way?

To answer this question we embarked on a six months journey in June 2022. We have extensively documented our ups and downs during this process, but for those who don’t need to dive so deep, here is a summary of what we have learned.

Lesson 1: What do you want to monitor?

Marketing influencers is not exclusively linked to Instagram but several reports have pointed out this platform as the place where the money is made. It is also a platform not common in academia or journalistic research, given the difficulties to work with and extract the content.

So, a nice challenge.

But setting up a model to monitor and analyse the vast Instagram universe is not a journey but an odyssey, so if you have limited resources (as is always the case in journalism), our first recommendation is to select what you want to investigate.

This selection can be straightforward if there is a defined list of accounts (eg. all the members of the Parliament or all the football players in the Champion League). Or it can be a real challenge.

Guess what? We were in this second group.

We did not have a list of accounts but a topic: kids influencers (basically, parents making money/receiving gifts out of their children's images). We, therefore, needed to find a non-biased method to identify a list of Instagram accounts to monitor their content. Following several experts’ advice, we looked for hashtags and specific words to identify users posting content related to our topic without cherry-picking.

Re-reading this last paragraph, it sounds like this was a walk around Britain’s rolling hills, but whoever with previous experience with any Facebook products knows that this feels more like an expedition on the Chilean Andes. If you are packing for Chile, our second recommendation is the marketing platforms companies’ dashboards.

We made an agreement with one company and used its dashboard to make a subset of the universe by filtering by hashtags, followers, number of posts, engagement rate, etc. This way we came up with our list of users we wanted to extract content from. 

We highly recommend reading our section about these social media marketing companies before any deal with one of these companies, as we have written about difference in their products, the most important questions before closing a deal and which strength and limitations there are when using them to select accounts.

Lesson 2: How to download the content? Spoiler alert: Be patient

We have created our own scraper and we are sharing it on Github. But by the time you might be thinking of using it, Meta would have probably introduced a new change and it won’t run smoothly. Instagram is a moving target.

You might want to get in touch with us to suggest improvements (which we would highly appreciate), but we have to pay rent and this has been an additional project to our current jobs. So, we might not be as fast as you need to fix problems with our scraper. Also, you might need to collect different parts of the data.

There are three other libraries (Selenium, Instaloader, and Instascrape and others – check the Resources section) you might want to test, as we did. We did not use them because we realized that our approach (building our own scraper) allowed us to easily pivot and gave us access to a wider pool of data.

That is the positive of our approach, but:

We have learned that all scrapers so far -including ours- will sooner or later result in a blocked account. The initial ban can be lifted via email and mobile verification but it will get blocked for good, so better creating multiple working accounts to proceed.

We have had setbacks from continuous changes to the Instagram API, forcing us to continue updating our solution to collect content from Instagram. 

And we had to limit the amount of content and budget more time to gather the data, as there is a limit of content per request (which also changed during the process of this project).

All these lessons meant time, patience, and some creative solutions that we have detailed here. If you fancy following our steps, we recommend checking first if there is any other update with the Facebook Open Research & Transparency (FORT) Analytics API we learned about while developing this project. 

Lesson 3: Have you thought about your infrastructure?

We have been ambitious (too ambitious, some have said), and when we started this project we wanted to collect all written posts, images and videos of thousands of influencers. We even thought about setting up an ongoing scraper to collect Stories.

While this might be all technical doable, there are another lesson we would like to share. Even if this is achievable, it requires time, time, more time (as explained above) and suitable infrastructure to store all the content, to process the data and to train any model.

We found little documentation on “how big” the content would be when we were doing our initial research, so we hope our process and how we reduced the amount of content to gather can be of help for similar projects with the type of infrastructure needed. 

We did not have any pre-existent infrastructure or environment built for this particular project but we had conversations about the architecture we would need to build with AWS and Google Cloud Platform – which comes at a cost. We have also learned about the Google Cloud Research Credits, which might be an option to explore for academia.

 If you have already jumped into the About us section, you might be thinking why we did not use our companies’ infrastructure. Well, this has been a problem and a solution. There are security restrictions that avoid external users to access some of our companies’ products. But one organization of the team managed to get the keys to its storage room and open it up for its new international colleagues.

This happened just a few weeks before the end of the fellowship. So, little can be said about this solution nor if that is enough to conclude the project.

Lesson 4: Behind Instagram data

After months, we finally managed to get a considerable amount of good data to start the analysis process (hooray!) but we ran out of time (booo!). We were not too disappointed about not building any model though, because we considered this outcome quite earlier in the project. 

Although using AI techniques to make journalism might be the core of this fellowship, any AI model is as good as the data it is trained with. We prioritised creating a method and a process to obtain a substantial amount of quality data. The model can come later. 

But we were curious… and had a glimpse of that precious Instagram data that costed so much effort (and some money). There were a few surprises

For instance, there is a useful field called “is_paid_partnership” which was “turned on” in very few of the posts. We also found that Meta is classifying as “state-controlled media” a bunch of (mostly mummies’) accounts. And we counted that only one in seven of the posts with an advertisement hashtags publishes them at the top of the caption.

These are not conclusive findings as our analysis was not exhaustive and it needs further revision, but it opens a box of Lego bricks (for a data nerd) to play with. Despite our initial goal being building a model to identify undisclosed partnerships, this is not the only construction possible. 

Marketing is not the only challenge the social media influencer industry faces, albeit it has taken a prominent role. The promotion of fake brands or unhealthy products and practices, the spreading of misleading rumours, the negative effect’s on people’s mental health and the deadly consequences for some teenagers are other problems social media has.

We just pick up one.