Gathering the data


Since 2020, Instagram has limited the information that can be gathered from public users and it has made scraping harder. There are public endpoints to collect profile data and a number of the latest posts using Instagram API, but the available data via these public API endpoints is quite limited and it does not include the historical content we needed. 

Meta (Instagram’s parent company) introduced a research API in 2021 but, at the time of developing this project, it did not include Instagram data and there was little information about whether Instagram data would be part of it, which type of data could be retrieved from it and at what level of detail. It has not been, therefore, a solution for this problem. 

There have been several attempts to facilitate this task and we tested the most up-to-date Python libraries to scrape Instagram during this project. 

- Instascrape (See here the test script)
- Instaloader (See here the test script)
- Selenium (See here the test script)

All of them required a user login or a session id to get access and collect different types of data. 

The most suitable for our project was Instaloader, which allows us to collect basic influencer information, posts, images and videos. This is the type of content we included in our current analysis, and, although Stories were first considered, gathering this information would have forced us to a continuous scraping, as this type of content is time-limited, and it would have added extra work and resources on the storage capacity phase. We, therefore, decided to exclude this task and not consider this content.

After several initial tests utilising public scrapers, we built our own scraping solution (which we have shared here). This gathers content using the internal API used by the Instagram website itself and it is based on serverless cloud infrastructure utilising cloud functions, storage, and other workflows.

Although the initial time saved by using an already-built public scraper solution seemed advantageous, it required additional time for the libraries to be updated when breaking changes to the API are introduced or scraper code breaks. Our approach has the advantage of allowing us to easily pivot and giving us access to a wider pool of data we could focus on.

But this does not mean our approach has no obstacles. 

Facebook tracks accounts, IPs, and User Agent of the scraper requests and while this can be evaded for some time, unfortunately, all scrapers -including ours- suffer from the same outcome at the end, as accounts used to log in get banned or blocked. The initial ban can be usually lifted via email and mobile verification but sooner or later that account used for the scraper gets blocked altogether and the user has to upload a static selfie with a sequence of random numbers or a video selfie to unlock it. In order to collect the content, we had to use multiple accounts on rotation to proceed. It’s difficult to estimate the exact amount of accounts needed to collect the data as it solely depends on the frequency of requests and scale of content extracted.   

Apart from the problem of getting blocked by Facebook, we had to deal with continuous changes introduced to Instagram API. This meant that the way in which our scraper accesses the content had to be updated each time, forcing an iterative approach to the solution. For instance, while in the past Instagram's internal API allowed 50 posts per request, recent changes reduced that number to 12 posts per request meaning 4x as many requests must be made to retrieve the same amount of data.

And the time available and the amount of content we were trying to gather forced us to find an alternative solution once more. 

Based on the list of accounts selected by the editorial team via the use of Tensor Social (see what is Tensor Social and how we select our first set of accounts), we had an initial list of 4,016 accounts from which we would like to scrape content from. But prior to the data gathering, we had to verify if the accounts still exist and, for this, we had to scrape their profiles to distinguish between public, private and deleted accounts. 

Only content from public users can be gathered unless the account used to retrieve the information is a follower of the private account before it becomes private. This first profile scrape resulted in 4% of the 4,016 initial accounts (140 do not exist and 1 is private) being discarded, which left us with 3,875 valid accounts and around 2.3M posts between them. 

A single post on Instagram can be a mixture of multiple images and videos which we aim to collect as well. Based on the structure of previously scraped users (see here), we identified that around 33% - 38% of the posts include multiple images and videos and we estimated that this would increase the size of the total posts collected from 2.3 million to around 3 million of posts. 

Using the number of posts we aim to collect and the average size of the images on Instagram we estimated that, overall, we would need between 1TB - 1.4TB of storage for images alone.

But we had to consider not only space but time. To collect all required data -that is all posts across the 3,875 listed accounts- with a single scraper running 24/7 we would need more than a month, and it might have taken even longer due to a significant increase in the rate of which accounts are blocked or banned.

Initial analysis indicated that, on average, the top ten accounts contain the same number of posts as the bottom 900 accounts. It was then important to pivot and create a subset of the original list of accounts. This new reduced list needed to balance between the number of accounts (so as to have a representative population), the number of posts collected (so as to have enough content), and the time and resources available to complete the task on time. 

As we explained in the section above (List of influencers), we narrowed down the list by considering those that specifically stated parents are running the accounts and by excluding bottom and top accounts by the number of posts and followers. As explained in the section above and detailed on the script of the analysis, the thresholds used were different for each of the countries as the characteristics of the populations were also different.

This final reduced list included 896 accounts, with another 34 becoming private since the first profile scraper was carried out. But even with this reduced list, the time and effort required to gather all the content were considerable, and we then decided to proceed in slots of 50 accounts per group gathering the latest 100 posts per account over the period of around ten days. 

Our final list included 447 accounts (around half of the reduced list) with around 52,000 posts and over 110,000 images (around 22GB) associated. Therefore we can roughly estimate that the full reduced list would have been around 100,000 posts and 210,000 images (around 42GB).

This scraper provides a very low-cost operational structure due to the use of ephemeral cloud functions and fully serverless infrastructure. It utilises cloud storage for images and big-query for the posts data to store the data immediately after the scrape. But all this content is directly stored in one of the member’s cloud infrastructure, whose company has restrictions for external users to access the platform. 

As storage infrastructure was not provided by default during this fellowship, we then moved all the content to another member’s private environment with no limitations to external users. The released version is, therefore, based on a hybrid approach where content is stored locally as JSON and NDJSON and does not require in-depth knowledge of cloud infrastructure to get started and allows for local processing and analysis.

Scraper architecture structure

    Brands dataset

    In parallel with collecting data from Instagram based on the accounts taken from Tensor Social, we also looked for datasets on Instagram influencer accounts already available. But there are only two datasets available (shared upon request) which were created by the University of South Florida’s researcher Seungbae Kim.

    The first dataset contains 33,935 Instagram influencer accounts classified into 9 categories. The dataset contains 300 posts per influencer and the total amount is 10,180,500 posts and 12,933,406 image files. The second dataset consists of 26,910 brands and 1,601,074 Instagram posts

    The first database will be useful in our project as a control dataset, so as to verify the outcome of our analysis and how reliable our results are. The second one will be used to extract a list of brands to monitor with the aim of verifying if these brands are named in the posts collected by us.

    The Python script to import this data is available in this repository.

      Rest API

      We have chosen to use to build our Rest API to expose the data. Strapi is an excellent open-source project that is at the same time highly scalable and suitable for rapid development.

      The data structure, which has evolved over time, was devised to look at multiple platforms, even though our goal for the Fellowship was to focus our attention solely on the elusive Instagram.

      In choosing entities and their relations, we did our best to create a flexible scaffold to contain the scraped data, adapting and extending it over time. We also wanted to be able to extract data and visualizations to more easily navigate the data and to make it available to some extent on this site. The alternative is to dive into JSON and CSV files.

      Data is available upon request just for research or investigative purposes.