What is changing?
Lightcast is improving its postings deduplication processes that will result in more robust reporting on the number of postings a given company has.
When will this change take place?
For Canada the change is planned on 6th October
For US the change is planned on 20th October
What will this change mean for me?
We are expecting the postings counts to change nationally as follows:
| United States | Canada |
2018 | -1.2% | 1.2% |
2019 | 0.3% | -0.3% |
2020 | -0.1% | -0.2% |
2021 | -2.1% | -1.2% |
2022 | -2.6% | -3.9% |
1 Jan - 31 Jul 2023 | -1.6% | -1.0% |
Why are you changing the deduplication process?
Lightcast has invested in better postings classification. With this, we have taken steps to use these improvements as part of our deduplication process which has largely remained untouched since inception.
What are you changing in the deduplication process?
We are changing two items:
The input for Company Name. In the new deduplication we will use our most up to date company classifier to identify and normalize the company name.
The use of expiration dates. In the new deduplication we will use expiration dates instead of using a fixed 60 day check to find duplicates.
Why am I seeing decreased numbers?
As we are better normalizing company names, we are able to use this to be able to better match and find duplicate postings within each company.
What does this mean for my analysis?
Users will expect to see some variations in specific company counts
As posting numbers per company have shifted due to better deduplication we have seen a proportional change in all other data elements, so customers may choose to recomplete some analysis due to this.
Looking at the change from a ranking/distribution standpoint we see the same/similar rankings in occupations, skills, titles and all other data points as we did previously.
What is your new deduplication process?
Lightcast collects job advertisements from over 51,000 sources to provide the most up to date and comprehensive labor market information. The collected advertisements become postings seen across Lightcast products, by applying a process to identify and remove duplicates. This process results in up to 80% of all jobs we collect being deduplicated.
The first step:
On a source-level basis, the spiders identify and collect newly posted advertisements.
The second step:
As the same new advertisement can be found across multiple sources, we check across these to find duplicates.
We do this in the following way:
On a daily basis, we code each advertisement with a normalized company name, a title, a location, and other fields. This is used to create a duplicate key.
The posting date / expiration date is determined for every advertisement.
The advertisements are then grouped by the duplicate key.
We check if the advertisements in this group had at least one common day in the last 120 days.
If yes: then only 1 deduplicated posting is returned.
For example, take three advertisements:
Advertisement 1: Posted on Day 1, Expired Day 60
Advertisement 2: Posted on Day 59, Expired Day 118
Advertisement 3: Posted on Day 100, Expired Day 140
If these were all grouped together, the Posting would have a range of 140 days. Therefore, they are split up into two deduplicated postings:
Posting 1: Containing Advertisement 1 and 2
Posting 2: Containing Advertisement 3
How does your expiration process work?
The expiration process can be found here