6 5 30 DEV 5-ef5f22b3 32-35 Make rultor merge script more strict. Let's incorporate more serious and strict build procedures. We should make test execution as well as static analysis (pylint, flake8, mypy) mandatory here and in the .github/workflows/poetry.yml. .rultor.yml @h1alexbel aliaksei.bialiauski@hey.com 2024-07-04T10:21:42Z 7 5 30 DEV 5-1c14f85c 40-45 Create release script for whole project. Let's create a one for the whole project, including all modules. We should release all the modules: `sr-data|train|detector` to the PyPI. By doing so, we should ensure that `sr-detector` is available to be installed as standalone CLI tool. For `sr-data` and `sr-train` we should output some prominent files like CSV files and model files. Let's skip sr-paper for now. .rultor.yml @h1alexbel aliaksei.bialiauski@hey.com 2024-07-04T10:21:42Z 19 9 35 DEV 9-92736c7b 47-50 Develop a prompt for README annotation. We should create a prompt that will help the model annotate repositories with <SR> and <non> tokens. Let's start with one that we specified in paper draft. sr-data/src/sr_data/tasks/highlight.py @h1alexbel aliaksei.bialiauski@hey.com 2024-07-09T15:48:01Z 23 9 45 DEV 9-856f4873 29-32 JSONDecodeError Expecting value: line 1 column 1 (char 0) on some records in the CSV. When sending records to the endpoint, some READMEs get rejected with that error. We should identify that rows and possibly remove/recover them. sr-data/src/sr_data/tasks/embed.py @h1alexbel aliaksei.bialiauski@hey.com 2024-07-12T14:17:49Z 24 9 30 DEV 9-c9e75cb4 65-68 Use ExtendsWith analogue or some temp directory for 'out'. We shouldn't manage removal of output files directly, instead, let's use more elegant solution - temp directory or something like extensions from JUnit. We should fix the same problem in test_filter.py as well. sr-data/src/tests/test_embed.py @h1alexbel aliaksei.bialiauski@hey.com 2024-07-12T14:17:49Z 117 74 45 DEV 74-2e746059 33-35 Exclude repositories that don't have any maven projects. We need to exclude all repositories that don't contain any 'pom.xml' inside. Don't forget to create a unit test for this. sr-data/src/sr_data/steps/maven.py @h1alexbel h1alexbelx@gmail.com 2024-10-03T14:58:50Z 118 74 60 DEV 74-d4549137 36-39 Parse 'build' JSON array of maven projects into most valuable information for embedding step. We should parse all maven projects from JSON array, extract some useful information from each, and merge into single input. For pom.xml parsing we can use XPATH, and XSLT for merging. sr-data/src/sr_data/steps/maven.py @h1alexbel h1alexbelx@gmail.com 2024-10-03T14:58:50Z 122 118 35 DEV 118-582c0e32 58-61 Remove branch that returns None if found == 0. We should remove this ugly branch that now returns None if we didn't find any files. Let's handle this more elegantly. This should affect main() method, where we check `if profile is not None`. sr-data/src/sr_data/steps/maven.py @h1alexbel aliaksei.bialiauski@hey.com 2024-10-04T15:15:14Z 142 137 35 DEV 137-386c23be 35-39 Resolve code duplication in preprocessing methods. Ideally, we should reuse `remove_stop_words`, `lemmatize` methods from extract.py step. Now we are duplicating logic, only slightly changing it to fit the input, will be more traceable to reuse existing methods located in extract.py. sr-data/src/sr_data/steps/mcw.py @h1alexbel h1alexbelx@gmail.com 2024-10-15T10:01:30Z 143 137 45 DEV 137-9eb7df67 40-43 Stop words filtering is weak. method remove_stop_words doesn't remove such words as: ['the', 'to', 'and', 'you', 'a'] and etc. We should remove such words too. Don't forget to create unit tests. sr-data/src/sr_data/steps/mcw.py @h1alexbel h1alexbelx@gmail.com 2024-10-15T10:01:30Z 201 134 35 DEV 134-1fb1708e 139-142 Remove run ad-hoc solution for just command resolution. Now, we passing run parameter from recipe to nested just invocations in order to resolve just command. We should refine our usage of full path in the entire justfile. justfile @h1alexbel h1alexbelx@gmail.com 2024-11-13T10:36:53Z 202 134 45 DEV 134-bbddaab9 143-146 Refactor recipes to be more optimally granular. We should create more major recipes in order to reuse across the project. The example of such step is `@experiment`. Let's do similar to the script inside `data.sh`, so it can be invoked from just using datasets step. justfile @h1alexbel h1alexbelx@gmail.com 2024-11-13T10:36:53Z 211 153 35 DEV 153-c5f98a2f 54-56 Extract first 15-20 words from first heading. Instead of extracting just n characters, we should extract an amount of words from the heading that contains not more than n characters. sr-data/src/sr_data/steps/sentiments.py @h1alexbel h1alexbelx@gmail.com 2024-11-18T16:31:18Z 226 75 60 DEV 75-c94dc9cc 57-60 Parse fetched YAML files, and calculate their complexity/strictness. We should retrieve the following information from fetched workflow: 1) number of jobs, 2) number of OSs, 3) number of steps in each job, 4) number of versions in ${{ matrix }}. sr-data/src/sr_data/steps/workflows.py @h1alexbel h1alexbelx@gmail.com 2024-11-21T08:26:16Z 227 75 60 DEV 75-378d74a3 61-64 Find release workflow from collected workflows. We should find workflow that releases the repo artifacts to some target platform. After we got parsed workflows, we can try to find one that makes releases. Probably, it can be one, that uses on:push:tags. For instance: <a href="https://github.com/objectionary/eo/blob/master/.github/workflows/telegram.yml">telegram.yml</a>. sr-data/src/sr_data/steps/workflows.py @h1alexbel h1alexbelx@gmail.com 2024-11-21T08:26:16Z 234 207 60 DEV 207-acf6634d 34-40 Enable tests for sentiments.py when rultor will be able to install torch and transformers. Currently, when rultor tries to install torch and transformers dependencies to be able to run steps/sentiments.py it fails with exit code 137. Versions were used: torch = "2.2.2", and transformers = "4.41.2". You can check the example of such build <a href="https://github.com/h1alexbel/sr-detection/pull/233#issuecomment-2497707744">here<a/>. After issue will be resolved, we should enable uncomment lines in `sentiments.py`, and enable respective tests in test_sentiments.py. sr-data/src/tests/test_sentiments.py @h1alexbel h1alexbelx@gmail.com 2024-11-25T11:55:28Z 242 240 30 DEV 240-306f9450 37-40 Add test case when clusters are not empty. We should add one more test case when clustering model generates clusters. To do so, we need to prepare a bigger dataset for model in order to find useful centroids and distributed entries close to them. sr-train/src/tests/test_clusterstat.py @rultor me@rultor.com 2024-11-26T12:41:12Z 291 244 35 DEV 244-70f0b4e2 91-94 Enhance workflow simplicity score with min and max adjustment. Currently, we just subtract collected value from 1. We should adjust it with min and max values from the dataset. So formula should look like: 1 - (row - min) / (max - min). sr-data/src/sr_data/steps/workflows.py @rultor me@rultor.com 2024-12-30T08:03:27Z 321 319 45 DEV 319-4abf3b34 44-48 Setup unit tests for merge. Currently, there is no unit tests that can catch errors in the datasets merging, only an integration test exists in sr-train/test_dataset.py. Would be crucial to add unit tests too, that will check the merging functionality, and help us catch bugs faster. sr-data/src/sr_data/steps/merge.py @h1alexbel h1alexbelx@gmail.com 2025-02-04T08:14:41Z 329 324 45 DEV 324-d307104c 43-46 Package and publish sr cli in docker registry. Currently, we are only releasing toolchain in PyPi. Let's return back our docker pipeline, but instead of sr-data and justfile scripts, we should use sr cli. Don't forget to return `.github/workflows/docker.yml`. .rultor.yml @h1alexbel h1alexbelx@gmail.com 2025-02-06T11:38:01Z