daily log 03-03-21
PROBLEM:
AWS has a zillion jobs with the title “Data Scientist” – YET all the descriptions and “Must Have Qualifications” are different depending on which department is needing the data scientist…
SOLUTION?:
- Attempt to scrape using beautifulsoup
- Realize it’s a react site (so it’s not rendering everything by the time soup is scraping)
- Attempt to timeout the scraping (i.e. please wait until all components are loaded before scraping) – this did not work
- Sigh, realize it’s silly to try to scrape when we can just USE THE NETWORK REQUESTS DUH
- Find the json populating this page…
- Use that…
- Do this…
import requests
import pandas as pd
all_jobs = []
def get_data(i):
url = "https://www.amazon.jobs/en/search.json?radius=24km&facets[]=location&facets[]=business_category&facets[]=category&facets[]=schedule_type_id&facets[]=employee_class&facets[]=normalized_location&facets[]=job_function_id&offset={}&result_limit=10&sort=relevant&latitude=&longitude=&loc_group_id=&loc_query=&base_query=data%20scientist&city=&country=®ion=&county=&query_options=&".format(i)
with urllib.request.urlopen(url) as url:
data = json.loads(url.read().decode())
return data
def add_data_to_df(data):
for job in data['jobs']:
all_jobs.append(job)
def do_the_thing():
i = 10
# while i < 21: for testing lololol
while i < 3071:
data = get_data(i)
add_data_to_df(data)
i += 10
df = pd.DataFrame(all_jobs)
df.to_csv('aws_jobs.csv')
Oh yeah, what team are they??
def get_label(team):
try:
return team['label']
except:
return 'no label'
df['team-label'] = df.apply(lambda x: get_label(x['team']), axis=1)