This article outlines the creation of Lightcast’s job postings data, from the collection of postings to enrichment of the data across 165+ countries.

It is important to note that job postings are not necessarily the same as job vacancies. There is a correlation, but many recruitment practices make this relationship imperfect. Job postings are a measure of recruitment marketing by employers purportedly looking to fill job vacancies.

Aggregation

Lightcast collects job postings worldwide from over 220,000 current and historical sources. Our advanced scraping technology obtains job advertisements from publicly available online job boards and company websites. Once a site is identified as a valid source of employment opportunities, a dedicated spider is programmed, tested, and activated. This spider regularly visits the site, extracts job information for all posted positions, and stores it in our database. Sites with the newest jobs or the highest frequency of posting changes are visited most frequently.

Lightcast revisits sites for up to 14 days from the initial scrape date to ensure all postings are collected. Therefore, a small percentage of postings may be added up to 14 days after the current date. For sites with daily collected postings, the time window from posting to scraping, processing, and publishing in our data is 36 hours.

More information on our Support Countries can be found here.

More information on our Global Sets can be found here.

Deduplication

Lightcast’s database is a full reflection of job listings posted across the internet. As such, robust processes are required to identify and remove duplicate listings. Lightcast applies a unique two-step approach to deduplication, resulting in up to 80% of the jobs we collect being deduplicated.

The first step: On a source-level basis, we use intelligence contained within the scraping spiders to identify new advertisements for that source. The spiders refrain from collecting advertisements that have previously been aggregated.

The second step: Since the same new advertisement can be found across multiple sources, we use normalized fields such as job title, company, and location, and check if these fields have been used in new advertisements found in another source. This process is checked across 60 days of data to identify duplicates.

To illustrate step two, here is an example: If a job for a Marketing Specialist at Google is posted for the first time on March 1st, Lightcast considers this the “original posting.” For the next 60 days, Lightcast considers any advertisements for this job found elsewhere as duplicates. In theory, if Google posts the same ad every day for a year on different sources, Lightcast will count it six times.

Data Curation

The posting data seen in Lightcast products is not smoothed. The deduplication process uses a fixed algorithm that ensures we deduplicate 80% of all postings collected. This high rate serves as redundancy and ensures stable data.

We curate the data to remove outliers, usually around 1% of the data monthly. This is done by a proprietary tool that removes postings considered bad data or noise in the dataset. Examples of these include:

Postings where the employee must invest their own money
Pyramid schemes/MLM postings
Sexually explicit postings
Discriminatory content postings
Spamming, including items such as gig economy, military, or trucking.

Job Postings Expiration

To expire a posting, we use the following methods:

Fetch-based expiration: We revisit previously scraped job advertisements and look for key phrases such as "Page not found" or "This job is no longer active." When these are found, we determine the advertisement to be expired.
Age-based expiration: A job advertisement expires after no more than 120 days.

For a posting made up of one advertisement, the above results in the posting itself being expired.

When a job posting is made up of many job advertisements, we look for all the advertisements that make up that posting. When they expire across all sources, we mark the posting as expired after a maximum of 120 days. The posting expiration is based on both the first and last seen advertisements.

The advertisements that make up the posting are combined into an array.
The expiration of the posting is based on both the first and last seen advertisements.

Active and Newly Posted

Newly Posted measures all postings first posted in that month.
Active measures how many postings were live during that month (even if originally posted in a previous month but still active by the employer).

Active postings provide insight into the total open demand for a given month, while Newly Posted offers a view of the market's behavior in that month and over time.

Enrichment Process

Once postings are collected, Lightcast technologies parse, extract, and code dozens of data elements, including Lightcast job titles, occupations, companies, and detailed data about the specific skills, educational credentials, certifications, experience levels, and work activities required for the job. We also collect data about salary, the number of openings, and job types. This high level of detail enables users to look beyond summary statistics to discover specific skills in demand and skills that job seekers can identify and acquire if needed.

Company Normalization

Starting with raw company names, we normalize them using proprietary criteria, stripping irrelevant information (e.g., LLC, Inc.) to classify them in our Companies Taxonomy.

After normalization, the clean name is matched to the best fit in our Companies Taxonomy, which includes metadata such as Tradestyle, NAICS codes, and staffing labels. Subsidiaries or establishments are generally rolled up under the parent company unless they advertise as a separate brand or product. For instance, “Walmart Canada” is classified as “Walmart.”

We do have exceptions for consideration of companies that advertise as a brand or product and output job advertisements, such as social media platforms. For example, postings may be advertised as TikTok but will appear under the taxonomy as ByteDance, Ltd, which is the actual employer. The same exception applies to hospitals in which they maintain a different name and self-sufficiency, however, are where appropriate, under the umbrella of a parent company. These will also appear within the taxonomy under the parent company.

In some labor marketplaces it is commonplace to 'hide' the employer information from public view. You would only know which company you are applying for after submitting some information in a screening process. For these geographies our recall for company tagging will be lower.

More information on the above and industry classification can be found here.

Education Level

Lightcast assigns an education level to each posting using a machine learning model. If multiple education levels are mentioned, the posting will be tagged with all levels. Possible values include High School/GED, Associate’s Degree, Bachelor’s Degree, Master’s Degree, or Ph.D./Professional Degree. If no education requirements are listed, the posting will be tagged as Unspecified.

Note: Education Levels are currently unavailable in Global JPA.

More information can be found here.

Employment Type

Postings are tagged as full-time, part-time, flexible hours, or intern. If the posting does not specify, full-time is assumed.

Full-time: More than 32 hours
Part-time: 32 hours or less
Flexible hours: If the posting mentions both full and part time, or a range of working hours

Note: Employment Type is currently unavailable in Global JPA.

Experience

Years of required experience are captured when available. Postings that do not include an experience level will not be displayed when the Minimum Experience Required filter is applied.

Location

Country, city, and state information is captured during the scraping process. Most postings list a major city as the reference point, not a specific neighborhood. If multiple cities are mentioned, only the first city is captured.

For example, if the posting contains:

"Driver Wanted in London, OH, Logan, OH, Mount Gilead, OH. Excellent pay."

London, OH would be the city designated for this posting.

Lightcast maps postings to traditional MSAs using a mapping based on Google geo-coding that links MSAs to the city-state combinations found in job postings. A similar process is used to map city-states to counties.

Note: Available location options in JPA for Global data are Country or Metropolitan more information can be found here.

Skills

For job postings we utilize the Lightcast Skills taxonomy as described here. On average, 13 skills are extracted per posting. Each skill in the Lightcast Taxonomy has one display name, but our models also use aliases, acronyms, abbreviations, and historic names to extract skills. The process begins by segmenting and tokenizing job postings to remove extra characters (punctuation) and new lines. The model then scans the text for word sequences that indicate skills in the proper context. For example, when "AWS" appears, the surrounding context helps determine whether it refers to the "American Welding Society" or "Amazon Web Services." A confidence score is assigned, and accuracy thresholds ensure quality predictions are displayed.

In order to accurately analyze Global job postings, it's important to understand language to avoid misinterpretations or missing important details, and to address language barriers that can impact analysis. We do not translate non-English job postings and use the native language of the job posting to find and tag the relevant skills. More information can be found on Global skills here including supported languages.

Advertised Salary

Some job postings include the salary or salary range. Lightcast extracts and cleans this information and includes it when it reflects the position accurately. Lightcast does not present an 'estimated salary' for the individual job posting.

Hourly salaries are converted to annual formats, and vice versa, using appropriate work hours for each country. For example, in Turkey, we use the annual standard of 2,340 work hours.

For Global data, it is important to scrutinize advertised salaries at a detailed level to ensure that local nuances are taken into account. Factors such as cultural norms, expectations, and conventions can be difficult to convey with a single number. For instance, some French employers provide free lunches at nearby restaurants, but this may not be reflected in the advertised salary number.

Our currency conversion rates are checked approximately every 4 weeks against openexchangerates.org.

Job Location

All job postings are analyzed for language that indicates whether the position can be fully or partially remote. This involves scanning the title and body field of each posting for job location-related terms. Common phrases include “remote,” “position can be located anywhere,” “work from home,” “telecommute,” and “partially remote.” Based on this, postings are tagged as Remote, Hybrid, or Non-Remote. The definition of "Remote" includes jobs that require living in a specific region but don’t require office attendance. Postings without any clear job location information are tagged as Unknown.

Titles

Raw titles are cleaned and normalized to our Lightcast Titles taxonomy, simplifying complex titles to more general ones. For example, a posting listed as “Data Science Manager, Messenger” for Facebook would be normalized to “Data Science Manager.”

Occupations

We classify occupations using a proprietary process that combines machine learning (ML) models with rules curated by our in-house taxonomy team.

Rules are applied first, and the model is used only if no rule matches. When a job posting matches a predefined rule pattern, it is assigned a Lightcast Specialized Occupation. This ensures that job postings with clear rule-based classifications are efficiently processed without further analysis by the machine learning model. Nonproprietary occupation taxonomies are then applied to postings based on the classified Lightcast Specialized Occupation and an additional set of disambiguation rules.

As we continuously improve the rules and models based on internal quality checks and client feedback, the data becomes increasingly accurate. This improved coding is then used to retrain the model for further refinement.

Throughout the process, a dedicated team performs hand curation and quality checks to ensure accuracy.

Note: Global JPA currently supports only Lightcast Occupation Taxonomy (LOT). More information can be found here.

Contract Type, Apprenticeships, and Internships

Lightcast classifies the following contract types: Temporarily, Temporary/Permanent, Permanent.

For Apprenticeship and Internship, Lightcast assigns a contract type to a posting using a machine learning model.

For Temporary, Temporary/Permanent, Permanent contract types. Lightcast defaults all postings to Permanent. Lightcast then uses machine learning as well as rules curated by our in-house taxonomy team to look for patterns in the title or job posting body that would indicate the postings is a temporary position. For instance patterns such as 'employment type, temporary', 'seasonal', 'summer opportunity', 'temp to hire' would classify a temporary position.

Note: Contract Type, Apprenticeship, and Internships are not currently available in Global JPA.

Unclassified Postings

Some job postings may have "Unclassified" data elements. This occurs either because the information is not present in the job advertisement or because our coding process is unable to extract and classify that element. We closely monitor the number of unclassified entries in the dataset and work to reduce them as part of our regular monthly maintenance cycles.

Job Posting Classifier Updates

Lightcast reclassifies postings data every 4 weeks. We take all of the historic and current raw postings data, then we reclassify all of the data with the most up to date versions of our classifiers, deduplicate the data, and publish the data.

This means that all historic data can have latest skills or any other new classifications and normalizations applied not only to new data but to data for all time.

For example, when we introduced the skill 'Generative AI Agents' in January 2025, we were able to identify this skill in all historic data.

This creates no down time and is seamless for our customers.

Further links to change logs can be found here:
https://docs.lightcast.dev/taxonomies
https://docs.lightcast.dev/updates/postings-volume-changelog

Job Posting Analytics (JPA) Methodology