This article outlines the creation of Lightcast’s job postings data, from the collection of postings to enrichment of the data.
It is important to note that job postings are not necessarily the same as job vacancies; there is a correlation, but many recruitment practices make it an imperfect relationship. Job postings are a measure of recruitment marketing by employers purportedly looking to fill job vacancies.
The methodology used to obtain job advertisements from publicly available online job boards and company websites is based on Lightcast advanced scraping technology. Once Lightcast identifies an online site as a valid source of employment opportunities, a dedicated spider is programmed, tested, and activated. The spider visits the site regularly and pulls job information for all jobs posted; the information is then stored in a database. The sites with the newest jobs or with the highest frequency of change in postings are visited most frequently. Lightcast currently scrapes more than 65,000 sites worldwide.
Lightcast’s database is a full reflection of job listings posted across the Internet, as such robust processes are required to identify and remove duplicate listings. Lightcast applies a unique two-step approach to deduplication that results in up to 80% of all jobs we collect being deduplicated.
The first step: On a source-level basis, we use intelligence contained within the scraping spiders to identify a new advertisement for that source. The spiders refrain from collecting advertisements that have previously been aggregated.
The second step: As the same new advertisement can be found across multiple sources. We use normalized fields including job title, company, location and check to see if these fields have been used in new advertisements found in another source. This is checked across 60 days of data to identify duplicates.
To illustrate ‘step two’, here is an example: if there is a job for a Marketing Specialist at Google posted for the first time on March 1st, Lightcast considers this as the ‘original posting’ then for the next 60 days Lightcast considers any advertisements found as duplicates. In theory, if Google posts the same ad every day for the entire year on different sources Lightcast will count it 6 times.
Once postings are collected, Lightcast technologies parse, extract, and code dozens of data elements including the following: Lightcast job title, Occupation, Company, detailed data about the specific skills, educational credentials, certifications, experience levels, and work activities required for a specific job, as well as data about salary, number of openings, and job type. The high-level of detail enables users to look beyond summary statistics to discover specific skills in demand and skills that job seekers can identify and acquire if needed.
Job Postings Expiration
To expire a posting we use:
A fetch-based expiration; by revisiting previously scraped job advertisements and looking for key phrases such as "Page not found" or "This job is no longer active" among others. When these are found we determine that advertisement to be expired.
An age based expiration; of a job advertisement at no more than 60 days.
For a posting made up of one advertisement, the above results in the posting itself being expired.
As a job posting can be made up of many job advertisements. For these, we will look for all the advertisements that make up that posting, when these are expired across all sources we mark that posting as expired on a maximum of 121 days, as follows:
The advertisements that make up the posting are combined into an array.
The expiration of the posting is based on both the first and last seen advertisements.
For example, take one posting that is made up of two advertisements:
Advertisement 1: Posted on January 1, 2022; Expired March 2, 2022
Advertisement 2: Posted March 2, 2022; Expired May 1, 2022
The posting will therefore be active from January 1, 2022 and will expire on May 1, 2022
Active and Newly Posted
Newly Posted measures all postings that were posted in that month.
Active measures how many postings were live during that month, (even if originally posted in a previous month but left active by the employer).
Active postings is a good way to get a view of the total open demand present in a given month while Newly Posted gives a better view of the behavior of the market in a given month and over time.
Company Normalization and Metadata
Starting with raw company names, we normalize these names using a set of proprietary criteria. This strips information from the name that is irrelevant to identifying the company correctly (e.g. LLC, Inc.). This leaves a normalized name with all the ingredients needed to classify it to our Companies Taxonomy.
After normalization, we match the clean name to the best fit in our Companies Taxonomy. Each company has associated metadata, including Tradestyle, NAICS codes, and staffing labels. If a company is a subsidiary or establishment of another company, we generally roll it up into the main company when the establishment or subsidiary has the parent in its name. For example, “Walmart Canada” would be classified as “Walmart.”
We do have exceptions for consideration of companies that advertise as a brand or product and output job advertisements, such as social media platforms. For example, postings may be advertised as TikTok but will appear under the taxonomy as ByteDance, Ltd, which is the actual employer. The same exception applies to hospitals in which they maintain a different name and self-sufficiency, however, are where appropriate, under the umbrella of a parent company. These will also appear within the taxonomy under the parent company.
Lightcast assigns an education level to each posting using a machine learning model to detect the presence of required or preferred education levels. If more than one education level is mentioned, the posting will be tagged with all levels mentioned. Potential values include High School/GED, Associate’s Degree, Bachelor’s Degree, Master’s Degree, or Ph.D./Professional Degree. In the case that the posting does not contain any educational requirements it will be tagged as Unspecified.
Postings are tagged as full-time (more than 32 hours), part-time (32 hours or less), flexible hours (if the posting mentions both full and part time, or a range of working hours that span both categories), or intern. If the posting does not specify, full-time is assumed.
Years of experience required for the position is captured where available. Not all postings include an experience level, the unspecified postings will not be displayed when the Minimum Experience Required filter is applied.
Country, city, and state information is captured during the scraping process when present. Lightcast also maps postings to traditional MSAs using a mapping based on Google geo-coding that links MSAs to the city-state combinations found in job postings. A similar process is used to map city-states to counties.
Skills data are extracted using the text of the posting. Lightcast takes the text of the posting and looks for sequences of words that indicate skills. Lightcast distinguishes between specialized skills, common skills, software skills and qualifications. Specialized Skills are skills that are primarily required within a subset of occupations or equip one to perform a specific task (e.g. “NumPy” or “Hotel Management”). Also known as technical skills or hard skills. Common Skills are skills that are prevalent across many different occupations and industries, including both personal attributes and learned skills. (e.g. “Communication” or “Microsoft Excel”). Also known as soft skills, human skills, and competencies. Software Skills are any software tool or programming component used to help with a job (e.g. Python, Workday, AutoCAD, Microsoft Excel, React.Js, Accounting Software, and 3D Modeling Software would all be considered “Software Skills”). Certifications are recognizable qualification standards assigned by industry or education bodies (e.g. “Cosmetology License” or “Certified Cytotechnologist”).
Read more about Lightcast skills here.
Some job postings include the salary or salary range of the vacancy. Lightcast extracts and cleans this information and includes it in the dataset when it is a likely and reasonable reflection of the position. Lightcast uses the 2080 work hours in a year to convert salaries that are listed in an hourly format to an annual format, and vise versa.
Remote or Hybrid
All job postings are scanned for the presence of language indicating that the advertised position can be filled by a remote or partially remote worker. This involves analyzing the text of each posting’s title and body for job location language. Many words and phrases are used to indicate a remote or hybrid position, including “remote”, “position can be located anywhere”, “work from home”, “telecommute”, “partially remote” and others. Postings containing language indicative of Job Location are tagged as Remote, Hybrid, or Non-Remote. It should be noted that the definition of Remote is broad enough to include postings that require that a person live in a particular region although coming into an office is not required. Postings that do not contain any indication of Job Type are tagged as Unknown.
Raw titles are collected at aggregation and are then cleaned and normalized to our Lightcast Titles taxonomy. For example, a posting for Facebook might have a job title of “Data Science Manager, Messenger”, the postings are then run through a tagging system, the job title would be normalized to “Data Science Manager.”
Lightcast uses machine learning models and rules to code occupations from the raw title and job description of the job posting.