Storing the data
Cleaning and parsing data
The process of storing the data had to go hand in hand with the cleaning phase, as it was important to preserve the raw data and add some additional missing information.
Data gathered from the scraper and stored on the AWS S3 Bucket was structured by country (media company) with a sub-folder for each influencer containing:
- a profile file in json and jsonl formats (sample)
- a file for the posts in json and jsonl formats (sample)
- a file for post tags in json and jsonl formats (sample)
- a file for the locations in kml format (sample)
- a folder with images in jpeg and webp formats
These files are connected among themselves through the user id and each of them contains a part of the information from the Instagram user. This data was destructured in smaller elements so as to be compatible with our architecture on the database via Strapi.
Technical infrastructure
As described in the section above, the amount of the data and the size of the content we aimed to collect was considerably high and data storage became a crucial node from the beginning of this project.
We could not find any information or statistics regarding the space and cost estimations for storage in similar studies during our initial research, but we knew that the volume of data produced by each influencer varies significantly - especially as Instagram limit the size for stories to 4GB for every 15 seconds of video.
We, therefore, decided to collect a sample of data to make a rough estimation of the storage needed. This sample was based on 30 accounts and all the content produced during one month of 2022 - usually August.
We did this test quite early in the project (the results can be found here), even before deciding the number of accounts we wanted to extract content from and while we were completing previous steps of the project. But this calculation was key to understanding that the space and cost for the storage would be quite high. So, to better control the final dimensions, we decided to limit ourselves to collecting text, images and metadata, excluding videos.
The fellowship programme did not include any environment or infrastructure that we could use for storage, but provided us with some budget and help with networking to find a solution for our necessities. A significant part of time and effort has gone into finding a suitable option.
During the process of this project, we had conversations with AWS and Google Cloud Platform to consider their architecture proposals.
While the overall size of the storage required to store the data could be met by either AWS or Google Cloud Platform easily, most of the cost would be related to the actual analysis tasks related to processing the data, training of models required for analysis, and image detection.
As we had a limited budget compared to the usual costs of these companies, we were also advised to apply for free storage via the Google Cloud Research Credits programme, which allows access to most products in Google Cloud Platform but we have not had the resolution of this application prior the end of the fellowship.
But in parallel to the conversations with AWS and GCP and the application to the Google Cloud Research Credits programme, we had to find alternative solutions to make progress while waiting for a resolution.
Data collected from Tensor Social via its API was stored on a postgresql database hosted on on-premises machine, and it created a temporary file system to allow for manual scrambling of the data (this is the Python script used).
This file was later used for the scraping section. The data collected via the scraper was then temporarily stored on a cloud system in one of our companies. But, given security restrictions, external users cannot access that database. So, we transfer the content to an AWS S3 Bucket already at our disposal inside another of the companies members and which can be accessed by users external to the media organisation.