Files
testing123/Desk_Review_Project_Documentation_Corrected.md
2026-05-21 08:40:24 -04:00

46 KiB
Raw Permalink Blame History

Desk Reviews Section 1 Project Documentation

Contents

  • Desk Reviews Section 1 Project Documentation 1
  • Contents 1
  • Folder manifest: 4
  • A note on project organization and re-usability 4
  • Report Running Guide: 5
  • Environment Setup 5
  • Exporting Datasets from Neoserra 7
  • Funding Milestones 7
  • Client List 11
  • New Business Start Milestones 12
  • Counselling Interval Snapshot 13
  • Satisfaction Survey 15
  • Training Events 17
  • Running The Reports 18
  • Network Wide Trainings Report 18
  • Graph Image Generation 18
  • Word Document Generation 20
  • Counselling Interval Snapshot Report 21
  • Word Document Generation 22
  • Milestone Analysis Reports 22
  • New Business Start Report Graph Generation 23
  • Capital Funding Report Graph Generation 23
  • Word Document Generation 24
  • Satisfaction Survey Report 25
  • Word Document Generation 26
  • NAICS Coverage Report 27
  • Word Document Generation 27
  • Changing Script Mappings 29
  • mappings.json key Dictionary 30
  • Per file notes/docs 33
  • scripts/dataset/desk_review_dataset_generator/make_client_list_dataset.py 33
  • External data sources queried 34
  • Script data prerequisites: 34
  • Script command line arguments 34
  • Column Mapping Configuration 35
  • scripts/dataset/desk_review_dataset_generator/dataset/make_county_naics_dataset.py 35
  • Script data prerequisites 36
  • Script command line arguments 36
  • Column Mapping Configuration 36
  • Example usage 37
  • scripts/dataset/desk_review_dataset_generator/make_trainings_dataset.py 37
  • Description 37
  • Script Data Prerequisites 37
  • Script Command Line Arguments 37
  • Column Mapping Configuration 38
  • scripts/dataset/desk_review_dataset_generator/make_satisfaction_survey_dataset.py 38
  • Script data prerequisites 39
  • Script command line arguments 39
  • Column Mapping Configuration 39
  • scripts/dataset/desk_review_dataset_generator/make_nps_dataset.py 40
  • Script data prerequisites: 40
  • Script command line arguments 40
  • Column Mapping Configuration 40
  • Graph Image Generation Scripts 41
  • A Note on How Image Filenames are Derived 41
  • How Files Are Retrieved Once They are Written to Disk 41
  • scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/counselling_interval_script.py 41
  • scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/pasbdc_funding_analysis_script.py 44
  • scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/nbs_analysis_script.py 46
  • scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/naics_census_analysis_script.py 49
  • scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/pasbdc_survey_analysis_script.py 52
  • scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/pasbdc_trainings_analysis_script.py 55
  • scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/pasbdc_center_topic_analysis_script.py 58
  • Word Document Generation Scripts 60
  • scripts/word_generation_scripts/counselling_interval_analysis.py 60
  • scripts/word_generation_scripts/generate_naics_word.py 61
  • scripts/word_generation_scripts/generate_trainings_analysis_word.py 62
  • scripts/word_generation_scripts/make_nbs_funding_analysis.py 64
  • scripts/word_generation_scripts/satisfaction_survey_report_controller.py 66

Folder manifest:

  • libs Python libraries written by me which are dependencies of this project
  • notebooks Python Jupyter notebooks that were used to prototype the finalized data analysis programs and graph generation. These were very early versions of the final analysis. You probably do not need these, but they have been kept in case they are needed for reference
  • reports Final exported report files
  • scripts Finalized data generation and analysis programs for this project

A note on project organization and re-usability

Within the scripts folder (the final working application for this report) there are several sub folders. Many of them are configured as python modules, any folder with a pyproject.toml can be installed as a module. Therefore, they can be taken out of this project and used somewhere else as long as that projects virtual environment has them installed as editable packages. (See https://setuptools.pypa.io/en/latest/userguide/development_mode.html)

Internally this project uses this mechanism to install internal library code to the virtual environment so it can be imported anywhere it is needed in the code (See https://stackoverflow.com/questions/72257466/can-i-have-an-editable-entry-in-my-requirements-txt)


Report Running Guide:

Environment Setup

All final utilities contained within this project are meant to be used on the command line. Below are the Windows instructions for how to setup your computer to run these programs.

  1. Download python and install it from here: https://www.python.org/ftp/python/3.14.2/python-3.14.2-amd64.exe

  2. Open the Windows Powershell application as an administrator

    [Image Placeholder: Windows PowerShell Run as Administrator]

  3. Navigate to this project folder using the cd command

    [Image Placeholder: PowerShell cd command]

  4. Execute the command python -m venv .venv

    • This will create the place where dependencies will be downloaded and stored
  5. Execute the command .\.venv\Scripts\Activate.ps1

    • This enables the environment for scripts to execute in
  6. Execute the command pip install -r ./requirements.txt

    • This installs the necessary packages to run the scripts as well as registers the dependencies I have written myself
  7. Now your environment is set up, simply execute .\.venv\Scripts\Activate.ps1 from the root of the project folder before executing any of the scripts to ensure they work properly

Additional note: The plotly package which is used to generate all of the graph figures in this project depends on a python package called kelido. Kelido requires an instance of Google Chrome to generate static image files. Before attempting to run any of these scripts ensure you have a copy of google chrome installed on your system. Either the official Google chrome or Chromium will work.

Exporting Datasets from Neoserra

Funding Milestones

To acquire the data for the capital infusion transaction milestones we will export the data from the “SBDC Minimum Deliverable Expectations” scorecard for the year of interest for the report.

  1. Navigate to https://pasbdc.neoserra.com/scorecards

  2. Click on “Goal Sets”

    [Image Placeholder: Goal Sets Menu]

  3. Locate the goalset starting with “SBDC Minimum Deliverable Expectations” for the year you are interested in. You may have to flip through the pages of goalsets to find it. (Tip: Use Ctrl + F (Cmd + F on Mac) to search the website to find the correct goalset)

  4. For this example we will be using FY25

    [Image Placeholder: Goal Sets List FY25]

  5. In the “Data Elements” table, click on the number in the “Value” column for the “Capital Infusion Transactions” row

    [Image Placeholder: Data Elements Table]

  6. Now we need to select the correct columns for the report to run. In the top right corner of the screen select “Filter > Select Columns.” Then, once the below dialog launches, select the columns as shown.

    [Image Placeholder: Filter Menu] [Image Placeholder: Select Columns Dialog]

  7. To export the dataset as a CSV: In the top right corner select “Reports > Quick Report.“

  8. On the resulting dialog click “Save.” This will bring you to a new tab with the contents of the CSV displayed.

  9. Copy this text and paste it in a file named funding_milestones_raw_FY25.csv (Or any other filename of your choice). Notepad is a good tool for this.

    [Image Placeholder: Quick Report Save Dialog]

TIP: You can save a scorecard to your “Dashboard” in Neoserra.

  1. Locate the “Accounts” tab on the right side of the screen and select “Link”

    [Image Placeholder: Link Accounts Menu]

  2. Type in your name in the “Accounts” box and click your name once it appears

  3. Then click “Save”

    [Image Placeholder: Link Accounts To Goal Set]

The scorecard should show up at: https://pasbdc.neoserra.com/dashboard

Client List

  1. Stay on the same “SBDC Minimum Deliverable Expectations” scorecard as the previous dataset

  2. Click on the “Value” number for the “Clients Advised (excluding training)” data element

  3. Select the following columns from “Filter > Select Columns”

    [Image Placeholder: Select Columns Dialog]

  4. Export the data as csv using “Report > Quick Report” to a csv file (client_list_raw_FY25.csv)

New Business Start Milestones

Note: While there is an option for this data within the deliverables scorecard, it does not allow the export of the columns we need, therefore we must source this data from a different section in Neoserra.

  1. Navigate to https://pasbdc.neoserra.com/activity/list/7?restore=1

  2. Recrate the following filter, changing the milestone date to the desired range, then click “Apply”

    [Image Placeholder: Filter Parameters]

  3. In the same manner as before, in the top right corner, select “Filter > Select Columns” and check the shown columns

    [Image Placeholder: Select Columns Dialog]

  4. Again, select “Reports > Quick Report”, click “Save,” then copy the resulting data into a file named nbs_milestones_raw_FY25.csv

Counselling Interval Snapshot

  1. Go back to the “Goal Sets” page (https://pasbdc.neoserra.com/scorecards/goalsets) and search for “PASBDC Date Interval Snapshot FY” and click the scorecard for the fiscal year of interest

    [Image Placeholder: Scorecards / Goal Sets]

The following steps should be repeated for the following data elements (numbers 2-6). Filenames are for demonstration; you can, and should, use different names as long as you are consistent when you run the scripts.

  • “Average # of Days between Client Sign-up and Client Start Date” -> days_client_signup_to_start_FY25.csv
  • “Average # of Days between Client Sign-up and 1st Counseling Session” -> days_client_signup_to_counselling_FY25.csv
  • “Average # of Days between Client Start Date to 1st Counseling Session” -> days_client_start_to_counselling_FY25.csv
  • “Average # of Days between Initial Counseling in the period to 1st Followup Counseling” -> days_initial_to_followup_FY25.csv
  • “Avg. # of Days between Training in the period to 1st Counseling Session” -> days_training_to_counselling_FY25.csv
  1. Click on the number in the “Value” column for the data element you wish to export

  2. Select the shown columns from “Filter > Select Columns”

    [Image Placeholder: Select Columns Dialog]

  3. Then export the CSV file with “Reports > Quick Report”

  4. Repeat steps 2-4 for each of the data elements listed above.

Satisfaction Survey

  1. Navigate to the goal sets page: https://pasbdc.neoserra.com/scorecards/goalsets

  2. Find the scorecard beginning with “PA SBDC Client Satisfaction Survey Responses FY” for the fiscal year of interest.

    [Image Placeholder: Goal Sets List]

  3. Click on the number in the “Value” column for the “Total Survey Responses” row

    [Image Placeholder: Data Element Table - Total Survey Responses]

  4. Select the shown columns via “Filter > Select Columns”

    [Image Placeholder: Select Columns Dialog]

  5. Create a quick report via “Reports > Quick Report”, and save the resulting file to satisfaction_survey_FY25.csv

Training Events

  1. Navigate to the training events page: https://pasbdc.neoserra.com/conferences

  2. Ensure the following filter is set, apply it by clicking “Apply”

    [Image Placeholder: Filter Parameters - Training Events]

  3. Select the below columns with “Filter > Select Columns”

    [Image Placeholder: Select Columns Dialog]

  4. Create a quick report and save it to the file trainings_raw_FY25.csv via “Reports > Quick Report”


Running the Reports With Make

Included in the root directory of the project is a makefile. Running this file will allow the entire report to be generated with a single command. Make is a code build system that works by tracking file modification times and allowing you to define steps that create files and other steps that depend on files. Therefore if an input file is changed, or an output file is deleted, then only the required commands will be run to update those parts of the report.

At the top of the makefile are the variables that control which files the script will use when it runs and how it will produce the output files. A makefile variable looks like this VARIABLE_NAME:=VALUE.

The makefile has 2 main sections that can be modified.

  1. The first is the CSV file section, each of the variables ending in CSV is the path to an input dataset file that the scripts expect. Change these variables to match where your csv files are located and what their names are.
  2. Below this section is the set of variables that control the naming of the output word documents. Leaving these as default is fine, but they can be customized if desired.

The make file will run on Linux or Windows. For Linux users, install make through your distributions package manager. For Windows users, follow the instructions at this link: https://leangaurav.medium.com/how-to-setup-install-gnu-make-on-windows-324480f1da69 For a crash course on make see: https://github.com/ahockersten/makefile_tutorial


Manually Running The Reports

Network Wide Trainings Report

Graph Image Generation

This report requires the raw trainings dataset we exported earlier.

  1. Navigate to the following subfolder in the project directory using an administrator Powershell window, either by using the cd command followed by the path you would like to navigate to, or by opening the path in file explorer and selecting “Open in Terminal”

    • Folder: section_1_graph_export_module/
  2. Within your powershell window, type: python trainings_analysis_script.py --fiscalyear <Fiscal year for the report> --inputcsv <path to the trainings csv file> --outpath <path to save the generated images to>

    [Image Placeholder: PowerShell Command Execution]

    The script may take some time to complete depending on the speed of your computer. Once it is done, all of the relevant files will be in the folder you specified when entering the command.

  3. Next we need to generate the center specific trainings report by changing directories to section_3_graph_export_module/ then typing: python trainings_topic_per_center_script.py --fiscalyear <Fiscal year for the report> --inputcsv <path to the trainings csv file> --outpath <path to save the generated images to> and hitting enter.

    [Image Placeholder: PowerShell Command Execution]

    This will find each center that is in the input trainings csv and will create the same set of graphs for each center. For ease of use select the same folder you used for the last script.

Word Document Generation

Now that we have the image files, we can generate the word document.

  1. Navigate to scripts\section_1_word_export_scripts/ and open a powershell terminal as before.
  2. Type the following command and press enter: python .\generate_trainings_analysis_word.py --images trainingsanalysis --output trainingsreport.docx -c "Lead Office" -c Wilkes -c Shippensburg -c Widener -c "Penn State" -c Bucknell -c "St. Francis" -c Gannon -c Clarion -c Temple -c Scranton -c Lehigh -c Duquesne -c Pittsburgh -c Kutztown -c “St. Vincent”

For each center that a center specific report was generated for, a -c argument must be supplied for the program to include the appropriate sections for that center. The name must be exactly as it appears in the filenames generated by the graph generation step. Notice how any center name that contains a space is wrapped in double quotes. This process should be very quick; the resulting Word document will be saved to the filename you specified in the --output parameter.

Counselling Interval Snapshot Report

This report requires the 5 CSV files exported from the Counselling Interval Snapshot scorecard.

  1. Navigate to section_1_graph_export_module and open a PowerShell terminal in the folder.
  2. Execute the command: python counselling_interval_script.py --signuptostartcsv ..\.. esting_datasets\days_client_signup_to_start_FY25.csv --signuptocounsellingcsv ..\.. esting_datasets\days_client_signup_to_counselling_FY25.csv --starttocounsellingcsv ..\.. esting_datasets\days_client_start_to_counselling_FY25.csv --initialtofollowupcsv ..\.. esting_datasets\days_inital_to_followup_FY25.csv --trainingtocounsellingcsv ..\.. esting_datasets\days_training_to_counselling_FY25.csv --fiscalyear FY25 --outpath intervalanalysis

Ensure that a fiscal year is provided with --fiscalyear (if the desired fiscal year has spaces, surround it in double quotes). Each of the arguments ending in csv should be pointed to the dataset CSV file they describe.

Word Document Generation

  1. Navigate back to section_1_word_export_scripts/

  2. Execute the following command: python3 counselling_interval_analysis.py --images <path to the folder containing the generated image files> --output example_document_name.docx

    [Image Placeholder: PowerShell Command Execution]

  3. Once the script completes, you should have a new word document at the location you specified.

Milestone Analysis Reports

There are two different graph generation scripts in this step. One for the new business start milestones and one for the capital funding milestones. They were kept as separate scripts as they deal with wholly separate sets of data. They are grouped together as there is only 1 word document generation script that pertains to these images.

New Business Start Report Graph Generation

  1. Navigate to section_1_graph_export_module/ and open a Powershell window into the folder.

  2. Execute the following python command: python . bs_analysis_script.py --inputcsv <path to the raw NBS dataset> --fiscalyear <the fiscal year of the data> --outpath <folder to place the graphs in>

    [Image Placeholder: PowerShell Command Execution]

All of the NBS related graphs will now be in the folder specified by --output.

Capital Funding Report Graph Generation

  1. Remain in the same folder you executed the NBS script in.

  2. Execute the following python command: python . unding_analysis_script.py --inputcsv ..\.. esting_datasets unding_milestones_raw_FY25.csv --fiscalyear FY25 --outpath milestoneanalysis

    [Image Placeholder: PowerShell Command Execution]

Now all the milestone report images should be in the folder you supplied to both scripts.

Word Document Generation

  1. Navigate back to the scripts\word_generation_scripts\ folder and open a Powershell terminal in the folder.

  2. Execute the following python command: python .\make_nbs_funding_analysis.py --images .\milestoneanalysis\ --output milestone_report.docx

    [Image Placeholder: PowerShell Command Execution]

Now a word document containing both the NBS and funding milestone reports should be at the location specified in the command.

Satisfaction Survey Report

  1. Navigate back to the section_1_graph_export_module/ directory and open a Powershell terminal.

  2. Execute the following python command: python .\satisfaction_survey_analysis_script.py --inputcsv <satisfaction survey csv path> --clientlistcsv <client list csv path> --outpath <image output directory>

    [Image Placeholder: PowerShell Command Execution]

Word Document Generation

  1. Navigate to the section_1_word_export_scripts/ directory (or appropriate folder).

  2. Execute the python command: python .\satisfaction_survey_word.py --images <path to the images folder> --output example_document_name.docx

    [Image Placeholder: PowerShell Command Execution]

The created word document will now be at the location specified by --output.

NAICS Coverage Report

This script uses three US government APIs to source data for census NAICS codes, BLS statistics, and USDA statistics. As such, they are flakey at very best. If the script fails with an error message that mentions SSL certificate issue, just keep re-running the script till it works.

  1. Navigate back to scripts\graph_generation_scripts\pasbdc_desk_review_graph_scripts and open a Powershell terminal in the folder.

  2. Execute the following python command: python . aics_census_analysis_script.py --inputcsv <path to raw client list CSV file> --fiscalyear <the fiscal year of the data> --outpath <desired output directory path>

    [Image Placeholder: PowerShell Command Execution]

The NAICs coverage report graphs will now be contained in the folder specified by --output.

Word Document Generation

  1. Navigate to section_1_word_export_scripts/ and open a Powershell terminal.

  2. Execute the following python command: python .\generate_naics_word.py --images <path to report images> --output example_document.name.docx

    [Image Placeholder: PowerShell Command Execution]

The report document will now be written to the path and filename specified by --output.


Changing Script Mappings

To balance both customizability and limit complexity for the end user of these scripts I have devised a simple single file solution for changing the naming of columns that the script expects in CSV files. Also included is the ability to change key sentinel values used to make data cleaning decisions within the script.

Each of the scripts in the project that need to ingest a CSV file also have a --mapping argument. This argument expects a JSON format file containing a list of strings that map to other strings. (See https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Scripting/JSON for more info on JSON).

In the documentation below, each script that uses the key value mapping system details which keys it will use if present in the provided mappings.json. You can then use that information in combination with the mappings.json key dictionary section to determine how to set your column mappings if you need to change them.

For example, take this mappings.json:

{
  "col_unified_naics": "NAICS_CODE",
  "col_census_pct": "Percentage",
  "col_naics_2": "NAICS_2",
  "col_naics_label":"NAICS2022_LABEL",
  "col_pa_naics_pct": "PA NAICs Code Percentage",
  "col_pasbdc_pct": "PASBDC NAICs Code Percentage",
  "col_census_estab": "ESTAB",
  "col_bls_industry": "industry_code",
  "col_bls_estab": "annual_avg_estabs",
  "col_usda_value": "Value",
  "col_neo_primary_naics": "Primary NAICS",
  "col_neo_naics": "NAICs"
}

The left side of the : is the name that these programs refer to the column names as, this is referred to as the key. The right side of the : is the value that replaces the left side value when the program runs, this is referred to as the value.

Additionally, if you only need to override one or two of the column names you can omit any of the keys that a script is listed as accepting and the defaults will be used. Each script stores its default values as a constant python dictionary at the top of the script called DEFAULT_CONFIG.

mappings.json key Dictionary

  • business_established_val: The tag to identify Business Established milestones.
  • business_start_impact_val: The tag to identify Business Start impacts.
  • col_attendees_range: The column containing the range of attendees that the total attendees value falls into
  • col_attendees_total: The number of attendees that came to an event
  • col_attrib_source: The column in the data at which the attribution source for a milestone can be found
  • col_bls_estab: The column in the BLS stats API to look for the count of establishments for government run institutions
  • col_bls_industry: The column in the BLS stats API to look for the NAICs code of an industry
  • col_census_estab: The column in the US Census CSV API to look for the count of establishments for a certain NAICs code
  • col_census_naics: The column in the US census data for the NAICs code
  • col_census_pct: The column for the percentage a NAICs code represents of the PA census data in the industry descriptions table
  • col_doc_level: The column in the data at which the documentation level can be found for a milestone
  • col_event_title: The column containing the training event title
  • col_funding_source: The funding source for an event
  • col_interval_data_value: Maps to the column containing the numerical interval/day values (Default: "Value").
  • col_is_preplanning: The output column that shows if an event is preplanning or not
  • col_naics_2: The output column in the clients dataset for the combined and parsed NAICs codes. Neoserra has many digits for specific industry codes, but we only care about the first 2 which signify the main industry.
  • col_naics_label: The output column at which the description of a NAICS code is stored
  • col_neo_affirmation: The column indicating if the funding was affirmed.
  • col_neo_answers: The column in the Neoserra data to look for the response survey answers
  • col_neo_attendees_total: The column containing the total number of attendees for a training event
  • col_neo_attribution_date: The date the attribution was recorded.
  • col_neo_attribution_source: The source of the milestone attribution.
  • col_neo_center: The column in the data at which the Center can be found
  • col_neo_client_id: Unique identifier for the client.
  • col_neo_county: The column in the Neoserra data to look for the county of a client
  • col_neo_event_title: The column containing the training event title
  • col_neo_funding_source: The column containing the funding source for a training event
  • col_neo_milestone_date: The date a milestone occurred.
  • col_neo_milestone_type: The category of a milestone.
  • col_neo_naics: The secondary column in the Neoserra data corresponding to the NAICs code of a client. This data is combined with the primary column.
  • col_neo_primary_naics: The column in the Neoserra data corresponding to the primary NAICs code of a client
  • col_neo_primary_topic: The column containing the primary topic of a training event
  • col_neo_program_format: The column containing information on if a training event was on demand (recorded and uploaded to the hub) or was in person
  • col_neo_training_topics: The list of all training topics associated with a training event
  • col_out_attendees_range: The output column for the tagged range of attendees for an event.
  • col_out_county: The output column for the county of a row of data
  • col_out_documentation_level: The output column name created during sanitization to store the ranked documentation status.
  • col_out_fips: The output column for the fips code for a row of data
  • col_out_missing: The output column for the number of missing NAICs codes for a county out of all possible codes
  • col_out_of_state: The column in the cleaned Neoserra data that indicates whether a client is in state or out of state. This is only used to exclude any counties outside of PA as this data later gets used to create a county map.
  • col_out_pct_missing: The percentage of possible NAICs codes that are missing from a county
  • col_out_total: The output column for the total number of clients in a county
  • col_out_unique: The output column for the number of unique NAICs codes that appear within a county.
  • col_pa_naics_pct: The output column in the final clients dataset that contains the percentage a NAICs code makes up of all business in the PA census
  • col_pasbdc_pct: The output column in the final clients dataset that contains the percentage that a NAICs code makes up of PASBDC client businesses
  • col_satisfaction_score: The column containing the question to derive a net promoter score from.
  • col_score: The column containing the answer to the question to derive a net promoter score from
  • col_training_id: The unique ID for a training event
  • col_training_topics: The column containing the additional training topics for a training event
  • col_unified_naics: The column for the NAICs code to be placed in for the industry description table which will be combined from the col_neo_primary_naics and col_neo_naics columns as for some reason there are two spots to put a NAICs code in Neoserra
  • col_usda_value: The column in the USDA stats API to look for the count of farm establishments
  • val_affirmation_missing: The string value used to identify milestones that have documentation but lack proper affirmation.
  • val_documented: The value to look for the the documentation level column that causes a milestone to be considered documented
  • val_preplanning: The value that if seen in the primary training topic will cause the training event to be considered a preplanning event
  • val_wrong_spot: The value to look for in the documentation level column that causes a milestone to be considered documented but in the wrong spot

Per file notes/docs

Each python file within this project that has something usable by the end user in it has some amount of documentation here. If it is not included here then it is most likely library code. All python files have inline documentation that will be useful if you want to use them. Listed here are required prerequisites for each script, what parameters they take as command line arguments, and a brief description of what they output.

scripts/dataset/desk_review_dataset_generator/make_client_list_dataset.py

Description: This file implements the data cleaning code to tag the Neoserra clients list score card with:

  • A 2 digit NAICs code derived from the multiple NAICs columns in the data
  • A percentage of how much that clients NAICs code makes up of all NAICs codes across the PASBDC
  • A percentage of how much that clients NAICs code makes up of all business in the PA Census
  • Whether the clients physical address county is in or out of state

It also generates a table that contains each 2 digit NAICs code with both a description of what that NAICs code means and how much it makes up of the PA census.

External data sources queried:

Script data prerequisites:

  • The client list dataset export from Neoserra. See report running guide for how to export this.

Script command line arguments:

  • -c / --censusyear: The census year to use (must be a valid US census year). Optional, defaults to 2022.
  • -u / --usdaapikey: The API key used to access the USDA census API. Optional, defaults to a key valid as of 12/17/25. A new key can be obtained from: https://quickstats.nass.usda.gov/api
  • -i / --inputcsv: The path to the exported client list CSV file from Neoserra. Required.
  • -m / --mapping: The path to a JSON file containing CSV column name overrides. Optional, defaults are provided.
  • -o / --out: The path to save the tagged client list to. Optional.
  • -t / --tableout: The path to save the census NAICs table data to. Optional.

scripts/dataset/desk_review_dataset_generator/dataset/make_county_naics_dataset.py

Description: This file implements the data cleaning code to turn the dataset generated by make_client_list_dataset.py into a form that can be used to create a county heat map of which countys clients have missing NAICs codes in their profiles. The output data contains the following for each county:

  • County name
  • Fips code (an identifier used to identify a specific state or county)
  • The count of unique valid NAICs codes present
  • The count of missing NAICs codes in client profiles
  • The total number of clients in the county
  • The percentage of clients with missing NAICs codes (derived from the previous attributes)

Script data prerequisites: This script requires the output file from running make_client_list_dataset.py. Pass this datasets file as the --inputcsv to this script.

Script command line arguments:

  • -i / --inputcsv: The path to the raw clients list dataset. Required.
  • -m / --mapping: The path to a JSON file containing CSV column name overrides. Optional.
  • -o / --out: The path to save the output dataset to. Optional.

Example usage: python make_county_naics_dataset.py --inputcsv client_data.csv This command would: Ingest the provided data, calculate the per county statistics above, and save the resulting file.

scripts/dataset/desk_review_dataset_generator/make_trainings_dataset.py

Description: This script contains all of the needed functionality to turn the raw export from Neoserra of all training events for a fiscal year into the dataset that can be used in our reports.

Script Data Prerequisites: This script requires the raw trainings dataset as exported in the report running guide.

Script Command Line Arguments:

  • --inputcsv: The path to the input dataset from Neoserra. Required.
  • -o / --out: The filename to write the cleaned dataset to. Required.
  • -m / --mapping: The path to a .json mappings file that allows the user to override column mappings. Optional.
  • --mode: Whether to make the master all trainings dataset or the center specific statistics dataset. Valid options: clean, stats.
  • -f / --fundingsources: The list of funding sources to include in the output dataset.

scripts/dataset/desk_review_dataset_generator/make_satisfaction_survey_dataset.py

Description: This file implements the data cleaning steps required to extract both individual question prompts and their answers from survey data extracted from Neoserra. While this script was tested on the client satisfaction survey, this code should work so long as there is a Center column and an Answers column that follows this format:

  • Survey questions start with a number and a . (ex 1., 2., 3. … etc)
  • Following each survey question is a new line and an answer

This script then transforms the data by adding 2 columns for each question parsed. The first column contains the questions text and the second column contains the recorded answer for the question.

Script data prerequisites: A CSV exported from the satisfaction survey scorecard as described in the report running guide.

Script command line arguments:

  • -i / --inputcsv: The path to the survey scorecard data. Required.
  • -m / --mapping: The path to a JSON file containing CSV column name overrides. Optional.
  • -o / --out: The path to save the output dataset to. Optional.

scripts/dataset/desk_review_dataset_generator/make_nps_dataset.py

Description: This file implements the data cleaning code to take the processed satisfaction survey dataset and derive a per center net promoter score (NPS). The value is multiplied by 100 to obtain a percentage-like figure.

Script data prerequisites: This script requires the output of make_satisfaction_survey_dataset.py as its input. The computation done within this script can apply to any numeric 1-10 scale score column. Question 1 is the main question of interest for this analysis so that is the default.

Script command line arguments:

  • --inputcsv: The path to the satisfaction survey dataset. Required.
  • -m / --mapping: The path to a JSON file containing CSV column name overrides. Optional.
  • -o / --out: The path to save the tagged client list to. Optional.

Graph Image Generation Scripts

A Note on How Image Filenames are Derived

A structured file naming scheme was developed to allow the word document generation scripts to load in all of the files they require without the user needing to manually specify each exact path. The format is structured as follows <report>_<graph name>_<graph variant>.png. For example, trainingsreport_attendee-bins_no-first-no-pre-percent.png.

How Files Are Retrieved Once They are Written to Disk

Once these graph generation scripts are executed the word document generation scripts must construct a list of graph images somehow. This is where the ImageRegistry class comes in, it can be found at scripts/word_generation_scripts/image_registry.py. It works by first collecting all .png files from the target folder. Once this data is acquired, it splits each filename on the underscores to get 3 separate strings. The class organizes this data within two nested python dictionaries that maps graph name -> variant -> image path.

scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/counselling_interval_script.py

Description: This file implements the control code needed to process yearly satisfaction survey data (specifically counseling intervals) exported from Neoserra. It passes the data to library functions to generate comparative graph images. The script focuses on several key performance intervals:

  • Days between Client Sign-up and Client Start Date
  • Days between Client Sign-up and 1st Counseling Session
  • Days between Client Start Date and 1st Counseling Session
  • Days between Initial Counseling and 1st Followup Counseling
  • Days between Training and 1st Counseling Session

Script command line arguments:

  • --signuptostartcsv: The path to the dataset containing "Average # of Days between Client Sign-up and Client Start Date". Required.
  • --signuptostartfigure: The filename for the generated sign-up to start date graph. Optional, defaults to "signuptostart".
  • --signuptocounsellingcsv: The path to the dataset. Required.
  • --signuptocounsellingfigure: The filename. Optional.
  • --starttocounsellingcsv: The path to the dataset. Required.
  • --starttocounsellingfigure: The filename. Optional.
  • --initialtofollowupcsv: The path to the dataset. Required.
  • --initialtofollowupfigure: The filename. Optional.
  • --trainingtocounsellingcsv: The path to the dataset. Required.
  • --trainingtocounsellingfigure: The filename. Optional.
  • --fiscalyear: The fiscal year tag (e.g., "FY2025") to be appended to graph titles. Required.
  • --outpath: The base directory path where all generated PNG files will be saved. Required.
  • --report: The report prefix used for naming the output files. Optional, defaults to "counselling-interval".
  • --mapping: The path to a JSON file containing CSV column name overrides. Optional.

scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/pasbdc_funding_analysis_script.py

Description: This script processes raw capital funding data from Neoserra to perform a detailed attribution analysis. It cleans the data and generates several visualization charts and supporting datasets to evaluate how funding milestones are documented and affirmed across the SBDC network. The script generates four primary analyses: Network-Wide Attribution, Attribution Rate, Theoretical Rate, and Director Confirmed.

Script command line arguments:

  • -i / --inputcsv: The path to the raw capital funding CSV dataset. Required.
  • -o / --outpath: The base directory path where all generated PNG graphs and CSV datasets will be saved. Required.
  • -f / --fiscalyear: The fiscal year associated with the input data (e.g., "2025"). Required.
  • -m / --mapping: Path to a JSON file containing CSV column name overrides. Optional.

scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/nbs_analysis_script.py

Description: This script implements the control logic for analyzing New Business Starts (NBS) milestones within the PASBDC network. It performs data cleaning and sanitization specifically for NBS milestones before generating a suite of visualization graphs and intermediate datasets used for reporting.

Script command line arguments:

  • --inputcsv: The path to the raw NBS analysis CSV dataset. Required.
  • --outpath: The base directory path where generated PNG graphs and CSV data exports will be saved. Required.
  • --fiscalyear: The fiscal year tag (e.g., "2025") used for labeling graph titles and naming output files. Required.
  • --mapping: Path to a JSON file for overriding default column name mappings. Optional.

scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/naics_census_analysis_script.py

Description: This file serves as the main control script for the NAICS analysis pipeline. It processes an input dataset of Neoserra client records to tag the client list with a unified 2-digit NAICS code, enrich the data with external census statistics, calculate internal statistics, and generate a "County Coverage" dataset that tracks missing NAICS codes per county.

Script command line arguments:

  • -inputcsv / --inputcsv: The path to the exported client data list from Neoserra. Required.
  • -outpath / --outpath: The directory path where all generated files (CSVs and Images) will be saved. Required.
  • -fiscalyear / --fiscalyear: The fiscal year string (e.g., "FY 25") used to label graphs and titles. Required.
  • -usdaapikey / --usdaapikey: The API key used to access the USDA census API. Optional.
  • -censusyear / --censusyear: The census year to use for fetching external data. Optional, defaults to "2022".

scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/pasbdc_survey_analysis_script.py

Description: This file implements the data cleaning and analysis pipeline for Neoserra client satisfaction survey data. It produces four key performance indicator (KPI) graphs: Response Counts, Average Score, Responses Per Client, Net Promoter Score (NPS).

Script command line arguments:

  • -inputcsv / --inputcsv: The path to the raw Neoserra survey export CSV. Required.
  • -clientlistcsv / --clientlistcsv: The path to the client list CSV. Required.
  • -outpath / --outpath: The base directory path to place generated files into. Required.

scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/pasbdc_trainings_analysis_script.py

Description: This file implements the end-to-end data pipeline for analyzing Neoserra training event data. It performs three main functions: Data Cleaning, Statistics Generation, and Multi-Variant Graphing.

Script command line arguments:

  • -inputcsv / --inputcsv: The path to the RAW Neoserra trainings CSV export. Required.
  • -outpath / --outpath: The base directory path to place generated files into. Required.
  • -cleanedfilename / --cleanedfilename: The name to give the intermediate cleaned dataset. Optional.
  • -fiscalyear / --fiscalyear: The label for the fiscal year to appear in graph titles (e.g., "FY25"). Optional.

scripts/graph_generation_scripts/pasbdc_desk_review_graph_scripts/pasbdc_center_topic_analysis_script.py

Description: This file is a specialized analysis script designed to generate individual training topic breakdowns for every SBDC center within the network. Unlike the aggregate reports, this script iterates through each center found in the dataset and produces a unique topic analysis chart for that specific location.


Word Document Generation Scripts

scripts/word_generation_scripts/counselling_interval_analysis.py

Description: This file is a report assembly script that generates the "Counseling Interval" section of the Neoserra scorecard. It uses the WordDocumentBuilder class to construct a Word document (.docx) populated with pre-generated analysis charts.

Script command line arguments:

  • --images: The path to the directory containing the generated analysis images/charts. Required.
  • --output: The filename (including path) to give the resulting Word document. Required.
  • --reportname: The filename prefix used to identify which images belong to this specific report within the images directory. Optional, defaults to "counselling-interval".

scripts/word_generation_scripts/generate_naics_word.py

Description: This file is the report assembly script for the "NAICS Penetration" section of the scorecard. It utilizes the WordDocumentBuilder to compile a two-page Word document containing the analysis generated by the NAICS census analysis pipeline.

Script command line arguments:

  • --images: The path to the directory containing the generated analysis images/charts. Required.
  • --output: The filename (including path) to give the resulting Word document. Required.
  • --reportname: The filename prefix used to identify which images belong to this specific report. Optional, defaults to "naicsanalysis".

scripts/word_generation_scripts/generate_trainings_analysis_word.py

Description: This file is the comprehensive report assembly script for the "Trainings Analysis" section of the scorecard. It iterates through a list of centers provided via command-line arguments to generate a dedicated topic analysis page for each specified center.

Script command line arguments:

  • --images: The path to the directory containing the generated analysis images/charts. Required.
  • --output: The filename (including path) to give the resulting Word document. Required.
  • -c / --center: The name of a center to include in the report (e.g., -c Kutztown). This argument can be repeated multiple times to include multiple centers. Required.

scripts/word_generation_scripts/make_nbs_funding_analysis.py

Description: Generates the "Capital Acquisition" (Funding) and "New Business Starts" (NBS) attribution sections of the scorecard. It utilizes two separate ImageRegistry instances to pull charts from independent analysis runs and compiles them into a single report.

Script command line arguments:

  • --images: The path to the directory containing the generated analysis images/charts. Required.
  • --output: The filename (including path) to give the resulting Word document. Required.

scripts/word_generation_scripts/satisfaction_survey_report_controller.py

Description: Generates the "Satisfaction Survey" section of the scorecard. It utilizes the ImageRegistry to retrieve the four Key Performance Indicator (KPI) charts generated by the survey analysis pipeline and compiles them into a formatted one-page Word report.

Script command line arguments:

  • --images: The path to the folder containing the images for this report. Required.
  • --output: The .docx file to save the final report to. Must end in .docx. Required.