Tech Company News Data Dump on HuggingFace: 3M Most Cited Posts About 3k Most Valued Tech Companies

cover
8 Feb 2024

HackerNoon curated the internet's most cited 3M+ tech company news articles and blog posts about the 3k+ most valuable tech companies in 2022 and 2023. These stories were curated to power HackerNoon.com/Companies, where we update daily news on top technology companies like MicrosoftGoogle, and HuggingFace. This dataset is open sourced under the Creative Common License on HuggingFace as Tech Company News Data Dump. Please use this tech company news data freely for your project :-)

https://huggingface.co/datasets/HackerNoon/tech-company-news-data-dump

How the tech companies were curated

Our team made a list of the most valuable technology companies, and added companies as they started to trend in the news and on HackerNoon. The first one and half thousand were public companies based on market cap. Then as companies got mentioned in HackerNoon stories and performed well in our startup of the year voting, we added created tech company news pages for them. Once a tech company news page is created, our system curates and stores the trending news, articles and blog posts about that company based on our rules and prompts that define what is a trending story.

How the stories, articles and blog posts were sourced

A combination of custom rules, prompts and conditions for relevance, specificity and trendiness using the Bing News API, the Brave News API, and the HackerNoon API. We drilled down industry match for each company, and heavily favored more trusted high ranking sites while also allowing for relevant lower ranking niche publishers. For each company, we surface the most relevant 10-20 stories on their main /company page (Microsoft as an example), and then feature the complete list of the company’s news, stories, mentions, articles and notable links in internet history on company-name/news (Google as an example).

How this tech company news data is organized

The columns are companyName, company URL, publishedAT, (story) url, title, featured image, and (meta) description. This follows how we organize data in our database. Every article is connected to at least one company. Some companies have more articles than other based on their share of voice, for example using the dataset viewer you can see Google has 99,152 results, 3M has 20,608 results, Adobe has 13,449 results, and NVIDIA has 19,811 results.

Without even downloading the data, you can search for company or publication names in the dataset viewer, like NVIDIA pictured below:

This dataset is open sourced under the Creative Common License on HuggingFace as Tech Company News Data Dump. Please use this tech company news data freely for your project :-) You could quantify a company’s aggregate share of voice online, you could measure sentiment analysis of a company’s digital news coverage, you could train your model to predict what headlines will publish about what companies in future, or whatever other research about large tech companies and media coverage your heart desires.

Check it out this open data here:

https://huggingface.co/datasets/HackerNoon/tech-company-news-data-dump?embedable=true