FAQ  •  Register  •  Login

Basic approach to data analysis of longitudinal samples

Forum rules
Please be as geeky as possible. Reference, reference, reference.
Also, please note that this is a mixed bag of math-gurus and mathematically challenged, so choose your words wisely :-)
<<

jamesaries

Participant

Posts: 14

Joined: Thu Sep 22, 2016 2:49 pm

Post Sun Feb 11, 2018 10:51 pm

Basic approach to data analysis of longitudinal samples

Hi all,

I am looking for some basic help with data analysis - as someone with limited experience of this and with little knowledge of R programming.

I am using the CyTOF to measure patient samples longitudinally at various time points post treatment. I am trying to identify an immune signature with 22 markers - mainly cell surface, chemokine and a few transcription factors. I am using PBMCs and mainly interested in T cell subsets.

I am aiming to run approx 200-250 patient samples and within each sample (2.5 million cells) I have spiked 0.5 million healthy PBMCs from one donor (frozen previously into aliquots), as a way of controlling for staining variability over time (healthy and patient samples each labelled with a different CD45 metal conjugate, then 3 million total cells stained with panel).

In terms of the data, firstly, I am looking at it manually in flowjo - separating the patient and healthy samples by CD45 gating (after cleaning the data). This is clearly not the best approach long-term but is hopefully enabling me to check that the experiment ‘worked’ and give me any broad clues as to changes in markers / subsets etc. One thing I’m surprised about - or maybe shouldn’t be - is the variation in the healthy sample between each batch (given these are the same cells - stored in batches of 10 million / vial and defrosted for each run). Not so much with CD4 / CD8 / CD27 but certainly with some of the other markers (e.g. CCR9, iNKT etc). I wondered if anyone had any comment on this? Presently, it makes it very hard to look at any changes in the patient samples with any meaning, given the variation in the healthy stained samples. Is anyone familiar with ways of correcting for the staining variation and how I might approach this?

For the longer-term I’m thinking of using CITRUS to look at differences between groups and possibly Phenograph as well.

I have limited biostatistician support where I am, so any recommendations on helpful biostats guys would also be much appreciated. In fact - any words of wisdom at all!

Best wishes,

James
<<

sgranjeaud

Master

Posts: 123

Joined: Wed Dec 21, 2016 9:22 pm

Location: Marseille, France

Post Mon Feb 12, 2018 8:28 am

Re: Basic approach to data analysis of longitudinal samples

Hi James,
This is a nice experiment with a lot of data. I recommand you to find help from a computational biologist or bio-statistician.
Longitudinal data means repeated measures of the same patient as you know. The analysis implies usually to remove the variability between the patients in order to focus on common trends. Mixed models or repeated measures ANOVA are typical tools to do such analyses. A simpler approach is to compare 2 times (or visits) and to use a paired test such as a T test if you look at the difference between two groups. I think this may be available in Citrus since it is based on the SAM package that provides such an analysis. Alternatively, you should look at the current developments at MD Robinson lab concerning linear models or at the Cydar package. With Citrus, avoid using the 5% default parameter as it will prevent you from looking for populations smaller than 5%.
Best.
<<

bc2zbUVA

Contributor

Posts: 22

Joined: Thu Nov 19, 2015 4:23 pm

Post Mon Feb 12, 2018 3:29 pm

Re: Basic approach to data analysis of longitudinal samples

Seconding what has been said already. Look at Robinson and cydar. The thing I like about cydar is that it lays out an approach to analyzing cytometry abundance data using the edgeR framework, which is extremely powerful and makes time course analysis in R quite efficient. If any of your biostatisticians have done RNA seq analysis, they should be able to help you with your cytometry data once it's clustered. We've performed several time course and covariate analyses at this point leveraging the edgeR framework, and it works as expected with cluster data derived from various algorithms (flowType, Citrus, FlowSOM, everything in CyTOFkid), not just the hyperspheres that cydar generates. The biggest hangup you might have is 250 patient samples. I don't know how efficient Citrus is at sample sizes that large. From my personal experience, FlowSOM does quite well with 100s of samples, and comes highly vetted by the community.

-Brian
<<

dtelad11

Master

Posts: 129

Joined: Mon Oct 31, 2016 6:26 pm

Post Mon Feb 12, 2018 4:39 pm

Re: Basic approach to data analysis of longitudinal samples

I recently launched an online platform for analyzing mass cytometry data that should answer many of the analytical needs you raised in this post. Our approach is very similar to what Brian mentioned -- we use FlowSOM for clustering and edgeR for the statistics. We also offer a graphical user interface, tech support, and several exports if you want to do follow-up bioinformatics. Here's a short video that quickly demos the platform:

https://www.youtube.com/watch?v=OS8iT3bgBpg

Email me at el-ad@astrolabediagnostics.com for more details. I'm happy to schedule a free demo.
<<

sgranjeaud

Master

Posts: 123

Joined: Wed Dec 21, 2016 9:22 pm

Location: Marseille, France

Post Mon Feb 12, 2018 7:36 pm

Re: Basic approach to data analysis of longitudinal samples

I share Brian's opinion concerning the limits of Citrus. Citrus depends on hierarchical clustering which requires to put all data in memory and/or is exponentially increasing in time as the number of cells increases. IMHO, sampling the FCS files is needed to balance your biological samples. But depending on how many patients you got at each time point and if you want to analyse two distant time points, citrus is interesting.

Brian, could you tell me what kind of values you push to edgeR? counts or percentages?

Best.
<<

bc2zbUVA

Contributor

Posts: 22

Joined: Thu Nov 19, 2015 4:23 pm

Post Mon Feb 12, 2018 7:57 pm

Re: Basic approach to data analysis of longitudinal samples

Lun has a nice writeup of how to use edgeR with abundance data in the vignette for cydar in the Testing for significant differences in abundance section.

http://bioconductor.org/packages/release/bioc/vignettes/cydar/inst/doc/cydar.html

Essentially, you create a table of counts for your data. This is demonstrated in detail in the Nowicka paper in the Differential cell population abundance section.

https://f1000research.com/articles/6-748/v2

Once you have your table, you can follow the standardize linear modeling approach in R, building up your design matrix. For anyone who wants to look at the types of designs able to be analyzed in edgeR framework, which itself is built on the limma framework, read the user manuals for both packages. They are exhaustive in detail and most experimental designs you can think of are covered in there. Chapter 9 in the limma manual covers paired desgin, batch effects, and time course analysis. All of the approaches outlined in that chapter can be directly fed into edgeR, replacing the limma commands with edgeR. For anyone who would like to know:

limma set of commands:

design <- model.matrix(~condition)
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
de_genes <- topTable(fit2)

edgeR set of commands

exprDesign <- model.matrix(~condition)
propData <- DGEList(count_table,
lib.size = colSums(count_table)) # BE VERY CAREFUL ABOUT YOUR lib.size parameter, this is the number of cells you collected. If you perform a hierarchical clustering (i.e. Citrus), your lib.size is not the sum of your columns.

fit <- estimateDisp(propData, design) # this is the first deviation from limma, estimating dispersion based on your design help accounts for the variability in cell counting
fit <- glmQLFit(fit, design, robust=TRUE) # this is analogous to the lmFit step above

res <- glmQLFTest(fit) # this is analogous to the combination of the eBayes AND topTable command above
topTable <- topTags(res, n = Inf)$table # this extracts the actual table of values and performs multiple hypothesis testing

I think that's about everything and I apologize if I miscommunicated anything.
<<

sgranjeaud

Master

Posts: 123

Joined: Wed Dec 21, 2016 9:22 pm

Location: Marseille, France

Post Mon Feb 12, 2018 9:52 pm

Re: Basic approach to data analysis of longitudinal samples

Many thanks Brian. This sounds perfect and very useful for all participants here.

Just a personal opinion concerning p-value. Big counts can get quickly a highly significant whereas the ratio between conditions is very moderate. One has to keep an eye on the fold change and its meaning. Lukas or Mark might correct me if I am wrong.
<<

bc2zbUVA

Contributor

Posts: 22

Joined: Thu Nov 19, 2015 4:23 pm

Post Mon Feb 12, 2018 10:08 pm

Re: Basic approach to data analysis of longitudinal samples

Excellent points, we also keep a tight watch on number of events per subpopulation. We tend to remove or merge subpopulations (or clusters if you prefer) that have fewer than 100 events. It is my understanding that it's difficult to sort out such rare populations (I'm just a bioinformatician), and from a computational perspective, rare populations can be very misleading, having a significant p value and a large effect size, and should be accounted for. Citrus lets you limit how small the clusters go, cydar incorporates a filtering step to discard rare populations, and the Nowicka F1000 paper leverages consensus clustering. I know Astrolabe also accounts for this. If you are rolling your own implementation of a clustering algorithm, keep your eye on all of those, p values, effect size, and cell count. EdgeR outputs the logCPM with its P value, FDR, and fold changes, so you can look at the value and see if your differentially abundant cluster is actually abundant.
<<

jamesaries

Participant

Posts: 14

Joined: Thu Sep 22, 2016 2:49 pm

Post Mon Feb 12, 2018 11:16 pm

Re: Basic approach to data analysis of longitudinal samples

Thanks everyone,

As always, great forum, great contributions. Appreciated.

James

Return to CyTOF data analysis

Who is online

Users browsing this forum: No registered users and 8 guests