9 1 30 DEV 1-46ce8519 28-31 Remove dummy test. This one is a dummy test in order to not fail the build. Let's remove it, when we will have tests for real code. Don't forget to remove this puzzle. tests/test_cli.py @h1alexbel hialexbel@gmail.com 2024-04-16T16:41:35Z 10 1 45 DEV 1-5b932efb 31-33 Add support for --repositories and --out parameter. We should add support to our cli to handle two options: --repositories, and --out. Both indicate the name of the file to read/create. objects/cli.py @h1alexbel hialexbel@gmail.com 2024-04-16T16:47:42Z 18 10 45 DEV 10-1fb8be50 48-50 Filter the repositories using general-like interface. We should execute filtering here using some general interface, so it would easy to use either LLM or ML filters. objects/cli.py @rultor me@rultor.com 2024-04-17T13:32:42Z 43 18 25 DEV 18-2d5b9487 45-48 Give the ability to provide input_text in the request. We should give ability to pass input_text in the request, I suggest to refactor this script into object that we can call and it will respond us like here. Don't forget to remove this puzzle. model/predict.py @rultor me@rultor.com 2024-04-23T08:39:09Z 62 18 30 DEV 18-d3f80dfc 58-64 Find effective way for processing readme. For now we are not processing readme because of <a href="https://github.com/h1alexbel/samples-filter/issues/39">this</a>. We need to find actual way to process readme too since it can be crucial data as model input. Let's study papers, outlined <a href="https://github.com/yegor256/cam/issues/227#issue-2200080559">here</a> first, rethink it and try to implement here. objects/cli.py @rultor me@rultor.com 2024-04-26T11:55:11Z 63 18 60 DEV 18-1f7b4ec1 39-42 Create integration test case for filter_pipe.py. We should create some sort of integration test that checks filtering together with model prediction, files creation and other things happening in #apply(). Don't forget to remove this puzzle. objects/filter_pipe.py @rultor me@rultor.com 2024-04-26T11:55:11Z 19 10 35 DEV 10-3a34afdb 51-53 Create {out} file with output result. We should create file with provided name for {out}. Don't forget to remove this puzzle. objects/cli.py @rultor me@rultor.com 2024-04-17T13:32:42Z 22 19 45 DEV 19-b3973cba 59-63 Implement chain of csv transformation. We should implement a transformation chain of csv files. For now we are just adding separate objects to this script. Let's create a class (let's call it `train` or `pipeline`) that would execute all transformation one by one. objects/cli.py @rultor me@rultor.com 2024-04-17T15:28:50Z 23 19 25 DEV 19-a9982129 36-39 Prepare files in /pipeline directory. We should create them before running actual processing. For now we need to manually create files. Don't forget to remove this puzzle. objects/input.py @rultor me@rultor.com 2024-04-17T15:28:50Z 24 19 45 DEV 19-03ce99ad 35-38 Use raw.githubusercontent.com instead of api.github.com. We should use static GitHub content hosting instead of their API in order to prevent problems with rate limits. Don't forget to remove this puzzle. objects/readme.py @rultor me@rultor.com 2024-04-17T15:28:50Z 11 1 25 DEV 1-ce00ea3a 34-37 Clean parameters that we don't need. Let's remove parameters that we don't really need in this cli. Less parameters we will have, the better. Don't forget to remove this puzzle. objects/cli.py @h1alexbel hialexbel@gmail.com 2024-04-16T16:47:42Z 14 1 30 DEV 1-85705910 8-11 Create merge script for rultor. We should create merge script for rultor to invoke it on a pull requests. Let's take .github/workflows/py.yml as a basis and do something similar. See also https://github.com/h1alexbel/samples-filter/issues/13. .rultor.yml @h1alexbel hialexbel@gmail.com 2024-04-17T08:32:10Z 34 30 90 DEV 30-28a95c17 31-34 Prepare labels and encodings for model training. We should collect and label dataset for model training. Besides that we need to split dataset into 80% of train data, 10% of validation data and 10% test sets. model/train.py @h1alexbel hialexbel@gmail.com 2024-04-18T17:24:49Z 38 34 60 DEV 34-c3c562de 29-32 Load dataset file from huggingface dataset repository. We should load dataset from huggingface dataset repo. It's located <a href="https://huggingface.co/datasets/h1alexbel/github-samples">here</a>. Let's pull it and use it. model/load.py @rultor me@rultor.com 2024-04-19T14:59:26Z 39 34 60 DEV 34-c96d2341 51-57 CSV column 'readme' is too big to read with reader. Column that contains readme in some cases is over the limit, which is `131072`. It this problem causes exception and terminates next row processing. Let's try to compress + decompress readme when writing/reading it. Another option can be to present the whole dataset in other format, for instance JSON. Don't forget to remove this puzzle. objects/dataset.py @rultor me@rultor.com 2024-04-19T14:59:26Z 35 30 35 DEV 30-60d34e7c 35-38 Document model training in model/README.md. We should document our model, what is it, how we train it, and how to re-train it. Let's create README.md in the `model` directory with this info. model/train.py @h1alexbel hialexbel@gmail.com 2024-04-18T17:24:49Z 44 41 45 DEV 41-fd32bc04 34-37 Combine all the csv files into merge.csv. We should combine all 4 files: default.csv, examples.csv, samples.csv, and tutorials.csv (all should have readme, use input.py for this) into one big file named merge.csv. Don't forget to remove this puzzle. model/combine.py @rultor me@rultor.com 2024-04-23T08:39:09Z 45 30 90 DEV 30-025c8878 25-28 Resolve issue with all-time negative prediction. We should resolve issue with negative prediction that performed all the time after model has been trained. Probably issue is inside train.py, where we probably ignore or mismatch some crucial for training parameters. model/predict.py @rultor me@rultor.com 2024-04-23T08:39:09Z 52 45 60 DEV 45-34bf1ac1 44-49 Train model on target columns instead just full_name. We should re-train model probably on description, topics and readme too. Let's try to standardize the input: first go full_name and description, than topics. README's content can be very big and can break the model. For readme we can use some summarization model that would provide limited-type response. model/train.py @rultor me@rultor.com 2024-04-24T13:56:55Z 46 30 60 DEV 30-56532a3d 33-37 Fetch pretrained model saved in HuggingFace. Let's fetch and download pretrained model from hugging face model hub. When model is trained, we should pack it and deploy into hugging face repository and it should legal to use it here. Don't forget to remove this puzzle. model/predict.py @rultor me@rultor.com 2024-04-23T08:39:09Z 81 75 45 DEV 75-fdce32da 25-29 Find a way to fetch date of last commit without using API. GitHub API usage has quite soft rate limits. Even with authentication, it can't process whole dataset (approx 1k) rows. Moreover, we can't let to pass the credentials in the CLI package. Let's investigate the ways to parse last commit date without API. model/data/last_commit.py @rultor me@rultor.com 2024-05-03T16:34:49Z 82 75 30 DEV 75-2238b4fc 47-50 Remove code duplication in matches_keywords.py. Let's make it more generic or decorate the logic in #matches. The only difference in both this functions is that we don't need extra processing in case of repository name before put it #any matcher. model/data/matches_keywords.py @rultor me@rultor.com 2024-05-03T16:34:49Z 88 75 25 DEV 75-e67d119b 46-51 Commands `trainrf` and `transformer` won't work without switching directories. For now commands `trainrf` and `transformer` won't work without switching directories before execution. Let's move this Makefile's commands to the root Makefile, since it probably will be more convenient to execute one command without switching directories. To make it happen, we should make all commands directory contexted inside the command. model/data/Makefile @rultor me@rultor.com 2024-05-06T11:43:07Z 89 75 30 DEV 75-51f268ee 35-37 Read CSV file from HuggingFace. We should fetch the whole CSV file uploaded to HuggingFace. This will help us to use the dataset without building it each time. model/rf/random-forest.py @rultor me@rultor.com 2024-05-06T11:43:07Z 90 75 40 DEV 75-e88026eb 80-83 Upload model and vectorizer as .joblib files to some file storage. In order to reuse already trained model, we should upload those files to file storage, or maybe even to HuggingFace. After it's uploaded we can fetch it and use #load function. model/rf/random-forest.py @rultor me@rultor.com 2024-05-06T11:43:07Z 104 75 45 DEV 75-676fe262 36-40 Compose train.csv data as input text. We should compose data in one string like described <a href="https://github.com/h1alexbel/samples-filter/issues/75#issuecomment-2094153280">here</a>. Before training, let's concat strings and embed them into some string template, like this: "Description: {description}, created at: {created}, ...". model/transformer.py @rultor me@rultor.com 2024-05-07T17:21:14Z 119 104 90 DEV 104-39f3530d 32-39 Split README.md into important segments. We should split README.md file into segments that are important for processing. For now the problem is in the README.md size. If file is reasonably big, than model will interpret it like negative, while README is taken from example repository. Let's fix it by splitting README.md file into segments and then picking the segment that represents this repository in most accurate way. Let's find out techniques that will help us in README splitting from this paper: https://www.researchgate.net/publication/317489055_Cataloging_GitHub_Repositories model/data/compose.py @rultor me@rultor.com 2024-05-12T11:03:07Z 107 105 90 DEV 105-983cf5f9 45-48 Bump code coverage to 55. After massive refactor code coverage drops down, (44.84%). Let's return back our coverage metrics to 55%. Don't forget to remove this puzzle. Makefile @rultor me@rultor.com 2024-05-08T07:09:27Z 109 105 60 DEV 105-16a41eb4 33-37 Process all fields required as inputs. We should process all fields required as inputs: full_name, readme, created_at, last_commit. In case of transformer we should do it in a prompt way, like repository advanced description. Check <a href="https://github.com/h1alexbel/samples-filter/issues/75#issuecomment-2094153280">this</a>. src/feed.py @h1alexbel aliaksei.bialiauski@hey.com 2024-05-08T09:28:46Z 113 109 90 DEV 109-cde4ae97 33-36 Feed `readme`, `last_commit`, `created_at`, and `commits`. We should feed other important fields too. For now we can feed readme, but transformer model can't process it since input tensor is too big. Let's resolve that problem and feed readme. src/feed.py @h1alexbel aliaksei.bialiauski@hey.com 2024-05-10T07:47:21Z 139 129 90 DEV 129-4dfd935d 32-35 Train isolation forest model. We need to find out how to properly train isolation forest model with gathered data in order to detect anomalies (SRs). Don't forget to remove this puzzle. model/Makefile @rultor me@rultor.com 2024-05-24T13:06:04Z 153 143 30 DEV 143-2cf5479a 42-46 We generate embeddings for each token instead of the whole unit. For now, we generate embeddings for each token. We probably should generate embeddings for joined tokens as one unit. In this case we can try to replace preprocessing steps with a huggingface tokenizers. Let's validate this assumption. models/model/pre/embeddings.py @rultor me@rultor.com 2024-06-06T06:09:57Z 154 129 90 DEV 129-f9fe4ae9 24-27 Train BERT transformer model. We need to find out how to properly train BERT model with gathered data in order to detect anomalies (SRs). Don't forget to remove this puzzle. models/model/t_bert.py @rultor me@rultor.com 2024-06-06T06:09:57Z