No description
This repository has been archived on 2026-04-05. You can view files and clone it, but you cannot make any changes to its state, such as pushing and creating new issues, pull requests or comments.
Find a file
2023-12-10 00:21:10 +00:00
.vscode Fixed automatic ensembl case-control analysis function 2023-12-09 23:17:33 +00:00
data Added some unit tests. 2023-11-15 00:57:23 +00:00
inst/extdata Changed dataset and implemented variance analysis with Ensembl as control. 2023-11-14 17:34:00 +00:00
man Changed to using import in DataLoading.R. 2023-12-10 00:04:54 +00:00
R Added imports for rest of R files 2023-12-10 00:07:31 +00:00
tests Created snapshot test for automatic case-control with Ensembl 2023-12-09 23:34:18 +00:00
vignettes Fixed vignette mapping example 2023-12-10 00:20:52 +00:00
.gitignore Finalized project for submission 2023-11-15 01:35:56 +00:00
.Rbuildignore Initial commit 2023-11-13 09:36:05 +00:00
DESCRIPTION Added "biocViews" section in description file 2023-12-10 00:21:10 +00:00
LICENSE Initial commit 2023-11-13 09:36:05 +00:00
LICENSE.md Initial commit 2023-11-13 09:36:05 +00:00
NAMESPACE Added imports for rest of R files 2023-12-10 00:07:31 +00:00
PhenoGenRLib.Rproj Initial commit 2023-11-13 09:36:05 +00:00
README.md Readme updated. 2023-11-15 01:52:15 +00:00
README.Rmd Readme updated. 2023-11-15 01:52:15 +00:00

PhenoGenRLib

The goal of PhenoGenRLib is to simplify nucleotide variant analysis.

As next generational sequencing (NGS) begins taking off, more and more data is readily available to be used. Arguably, there is an overabundance of data that has yet been used to its fullest potential. PhenoGenRLib promises to provide simple ways of loading VCFs, associating them with sample metadata, and lastly, running associative studies by applying the metadata.

Installation

You can install the development version of PhenoGenRLib like so:

require("devtools")
devtools::install github("RealYHD/PhenoGenRLib",
build vignettes = TRUE)
library("PhenoGenRLib")

Getting Started

To get started, have a datasheet ready in the form of a CSV. This datasheet should at the very least, contain one column, where each row in that column contains the filename of the VCF including the .vcf. For the following example, we will assume that such a file is called huntingtons_datasheet_shortened.csv and is located at ./inst/extdata/huntingtons_datasheet_shortened.csv with the column containing the VCF filenames being named vcfs. We will also need the location of the VCFs. Lets assume they can be found at the same place as the metadata CSV ./inst/extdata/. Then:

library(PhenoGenRLib)
variants <- PhenoGenRLib::linkVariantsWithMetadata(
  metadata = "inst/extdata/huntingtons_datasheet_shortened.csv",
  vcfDir = "inst/extdata/",
  vcfColName = "vcfs"
)
#> Rows: 8 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): vcfs, dummy_pheno
#> dbl (1): chromosome
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> READING VCF
#>  * checking if file exists... PASS
#>  * Reading vcf header...
#>    Done
#>  * Reading vcf body...
#>    Done
#>  * Parse vcf header...
#>    Done
#>  * Split info...
#>  * Done
#>  * Split samples...
#>    Done
#> READING VCF
#>  * checking if file exists... PASS
#>  * Reading vcf header...
#>    Done
#>  * Reading vcf body...
#>    Done
#>  * Parse vcf header...
#>    Done
#>  * Split info...
#>  * Done
#>  * Split samples...
#>    Done
#> READING VCF
#>  * checking if file exists... PASS
#>  * Reading vcf header...
#>    Done
#>  * Reading vcf body...
#>    Done
#>  * Parse vcf header...
#>    Done
#>  * Split info...
#>  * Done
#>  * Split samples...
#>    Done
#> READING VCF
#>  * checking if file exists... PASS
#>  * Reading vcf header...
#>    Done
#>  * Reading vcf body...
#>    Done
#>  * Parse vcf header...
#>    Done
#>  * Split info...
#>  * Done
#>  * Split samples...
#>    Done
#> READING VCF
#>  * checking if file exists... PASS
#>  * Reading vcf header...
#>    Done
#>  * Reading vcf body...
#>    Done
#>  * Parse vcf header...
#>    Done
#>  * Split info...
#>  * Done
#>  * Split samples...
#>    Done
#> READING VCF
#>  * checking if file exists... PASS
#>  * Reading vcf header...
#>    Done
#>  * Reading vcf body...
#>    Done
#>  * Parse vcf header...
#>    Done
#>  * Split info...
#>  * Done
#>  * Split samples...
#>    Done
#> READING VCF
#>  * checking if file exists... PASS
#>  * Reading vcf header...
#>    Done
#>  * Reading vcf body...
#>    Done
#>  * Parse vcf header...
#>    Done
#>  * Split info...
#>  * Done
#>  * Split samples...
#>    Done
#> READING VCF
#>  * checking if file exists... PASS
#>  * Reading vcf header...
#>    Done
#>  * Reading vcf body...
#>    Done
#>  * Parse vcf header...
#>    Done
#>  * Split info...
#>  * Done
#>  * Split samples...
#>    Done

Checkout the documents and vignettes for where to go from here!

Contributions

PhenoGenRLib stands on the shoulder of giants, and it would be a disservice to not name them:

  • Thank you Syed Haider et al. for providing bedr. It was greatly helpful in simplifying the data ingress features.
  • Thanks to the entire Biomart Team for providing an awesome and easy to use interface to large public databases!
  • ggplot2 was very helpful in generating figures. Thanks to Wickham et. al!
  • This entire project wouldnt have been possible without the help of the TidyVerse team. Despite not using every single package from their library, much work and diagnostic made use of their tools.
  • Tibble helped simplify data storage and accession. Thanks Muller et. al!

No generative AI was used for this project directly, however, learning about how R works and how some of the syntax differs from other languages was aided by ChatGPT.

This was a BCB410H1 UofT Bioinformatics project by Harrison Deng.

Citations

Müller K, Wickham H (2023). _tibble: Simple Data Frames_. R package
  version 3.2.1, <https://CRAN.R-project.org/package=tibble>.

Haider S, Waggott D, C. Boutros P (2019). _bedr: Genomic Region
  Processing using Tools Such as 'BEDTools', 'BEDOPS' and 'Tabix'_. R
  package version 1.0.7, <https://CRAN.R-project.org/package=bedr>.

BioMart and Bioconductor: a powerful link between biological
  databases and microarray data analysis. Steffen Durinck, Yves Moreau,
  Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang
  Huber, Bioinformatics 21, 3439-3440 (2005).

H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
  Springer-Verlag New York, 2016.

Wickham H, Hester J, Bryan J (2023). _readr: Read Rectangular Text
  Data_. R package version 2.1.4,
  <https://CRAN.R-project.org/package=readr>.

Acknowledgements

This package was developed as part of an assessment for 2023 BCB410H: Applied Bioinformatics course at the University of Toronto, Toronto, CANADA. PhenoGenRLib welcomes issues, enhancement requests, and other contributions. To submit an issue, use the GitHub issues.