daily log 03-03-21

less than 1 minute read

PROBLEM:

AWS has a zillion jobs with the title “Data Scientist” – YET all the descriptions and “Must Have Qualifications” are different depending on which department is needing the data scientist…

SOLUTION?:

  1. Attempt to scrape using beautifulsoup
  2. Realize it’s a react site (so it’s not rendering everything by the time soup is scraping)
  3. Attempt to timeout the scraping (i.e. please wait until all components are loaded before scraping) – this did not work
  4. Sigh, realize it’s silly to try to scrape when we can just USE THE NETWORK REQUESTS DUH
  5. Find the json populating this page…
  6. Use that…
  7. Do this…
import requests
import pandas as pd

all_jobs = []
def get_data(i):
    url = "https://www.amazon.jobs/en/search.json?radius=24km&facets[]=location&facets[]=business_category&facets[]=category&facets[]=schedule_type_id&facets[]=employee_class&facets[]=normalized_location&facets[]=job_function_id&offset={}&result_limit=10&sort=relevant&latitude=&longitude=&loc_group_id=&loc_query=&base_query=data%20scientist&city=&country=&region=&county=&query_options=&".format(i)
    with urllib.request.urlopen(url) as url:
        data = json.loads(url.read().decode())
        return data

def add_data_to_df(data):
    for job in data['jobs']:
        all_jobs.append(job)

def do_the_thing():
    i = 10
#     while i < 21: for testing lololol
    while i < 3071:
        data = get_data(i)
        add_data_to_df(data)
        i += 10
df = pd.DataFrame(all_jobs)
df.to_csv('aws_jobs.csv')

Oh yeah, what team are they??

def get_label(team):
    try:
        return team['label']
    except:
        return 'no label'
df['team-label'] = df.apply(lambda x: get_label(x['team']), axis=1)